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Abstract — WiMAX (Worldwide Interoperability for 
Microwave Access) technology has emerged in response to 
the increasing demand for multimedia services in the 
internet broadband networks. WiMAX standard has 
defined five different scheduling services to meet the QoS 
(Quality of Service) requirement of multimedia 
applications and this paper investigates one specific 
scheduling service, i.e. UGS scheduling. In parallel, it was 
observed that in the difference of the traditional quality 
assessment approaches, nowadays, current researches are 
centered on the user perception of the quality, the existing 
scheduling approaches take into account the QoS, mobility 
and many other parameters, but do not consider the 
Quality of Experience (QoE). In order to control the 
packet transmission rate so as to match with the minimum 
subjective rate requirements of each user and therefore 
reduce packet loss and delays, an efficient scheduling 
approach has been proposed in this paper. The solution 
has been implemented and evaluated in the WiMAX 
simulation platform developed based on NS-2. Simulation 
results show that by applying various levels of MOS 
(Mean Opinion Score) the QoE provided to the users is 
enhanced in term of jitter, packet loss rate, throughput 
and delay. 

Keywords'. WiMAX, QoE, QoS, UGS, NS-2. 

I. INTRODUCTION 

Habitually, the network has been assessed 
objectively by measuring some parameters to evaluate 
the network service quality. This evaluation is known as 
the QoS of the network. The term QoS refers to the 
guarantees on the ability of a network to deliver 
predictable results and a more deterministic 
performance, so data can be transferred with a 
minimum delay, packet loss, jitter and maximum 
throughput. The QoS does not take into account the 
user's perception of the quality. Another approach 
which takes into account the user's perception is named 
QoE, it's the overall acceptability of an application or 
service, as perceived subjectively by the end user, it 
groups together user perception, expectations, and 
experience of application and network performance. 



In order to get a more comprehensive view of the 
quality perceived by end users, QoE it has become 
increasingly a very interesting area of research. Many 
related works was presented on analyzing and 
improving QoE [12] in WiMAX network. The study 
presented in [14] suggested an estimation method of 
QoE metrics based on QoS metrics in WiMAX 
network. The QoE was estimated by using a Multilayer 
Artificial Neural Network (ANN). The results show an 
efficient estimation of metrics of QoE with respect to 
QoS parameters. 

In [6, 7, 8], the authors focus on the ANN method to 
adjust the input network parameters to get the optimum 
output to satisfy the end users. Especially, the success 
of the ANN approach depends on the model's capacity 
to completely learn the nonlinear interactions between 
QoE and QoS. In [16], Muntean proposes a learner QoE 
model that considers delivery performance-based 
content personalization in order to improve user 
experience when interacting with an online learning 
system. Simulation results show significant 
improvements in terms of learning achievement, 
learning performance, learner navigation and user QoE. 

In [3], our study was focused on studying and analyzing 
QoS performances of VoIP traffic using different 
service classes in term of throughput, jitter and delay. 
The simulation results show that UGS service class is 
the best suited to handle VoIP traffic. This paper 
proposes a QoE-based model in order to provide best 
performances in WiMAX network especially for the 
real-time traffic. The target of this improvement is to 
schedule traffic of UGS service class. 

The rest of this paper is organized as follows. A short 
description of WiMAX technology is given in section 2. 
In section 3, a QoE overview background is presented. 
The proposed QoE model is described in detail in 
section 4. Simulation environment and performance 
parameters are presented in section 5. Section 6 shows 
simulation results and analysis. Finally, section 7 
concludes the paper. 
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II. WIMAX TECHNOLOGY 

WiMAX is a wireless communication standard based 
on the 802.16 standards [10, 11], the main objective of 
WiMAX is to provide an Internet broadband connection 
to a coverage area with a radius of several kilometers. 
Unlike ADSL (Asymmetric Digital Subscriber Line) or 
other wired technologies, WiMAX uses radio waves, 
similar to those used for mobile phone. 

WiMAX can be used in point-to-multipoint (PMP) 
mode in which serving multiple client terminals is 
ensured from a central base station, and in point-to- 
point (P2P) mode, in which there is a direct link 
between the central base station and the subscriber. 

PMP mode is less expensive to implement and operate 
while P2P mode can provide greater bandwidth. 

A. QoS in WiMAX Network 

Since QoS support is an important part of WiMAX 
network, the concept of QoS was introduced natively in 
WiMAX [18], so this protocol ensures the good 
operation of a service. Some services are very 
demanding; VoIP cannot tolerate delay in the 
transmission of data. WiMAX uses service classes to 
allow different QoS between each communication. 

The concept of QoS mainly depends on the service 
provided, its sensitivity to transmission errors, its 
requirement of response time... etc. For VoIP traffic, 
one of the challenges is related to network congestion 
and latency, we will need a real-time traffic transfer, 
with very low latency and low jitter. A complete 
definition of QoS often refers to the mode of transport 
of information, although the solution adopted by the 
network to provide the service must remain transparent 
to the user. 

Satisfying QoS requirement becomes very imperative in 
IEEE802.16 systems to provide best performance, in 
particular in the presence of various types of 
connections, namely the current calls, new calls and the 
handoff connection. 

B. WiMAX Network Architecture 

The architecture of WiMAX network consists of 
base station named BTS (Base Transceiver Station) or 
BS (Base Station) and mobile clients or stations (SS 
Subscriber Station). The base station acts as a central 
antenna responsible for communicating and serve 
mobile stations, in their turn, serve clients using WIFI 
or ADSL. The BS can provide various levels of QoS 
over its queuing, scheduling, control, signaling 
mechanisms, classification and routing. Figure 1 shows 
the architecture of WiMAX network [10, 11]. 



Subscriber Station Node 



Application 



Connection Classification 



UGS rtPS nrtPS BE 



Modulation 
Scheduling 
Routing 



Base Station Node 



Admission Control 



Uplink Packet 
Scheduling 
For UGS Service Flow 
defnded by IEEE 802.16 



Demodulation, Packet 
Scheduling undefinded 
for rtPS, BE, nrtPS by 
IEEE 802.16 



Figure 1 : WiMAX Network Architecture 



C. Different Service Classes in WiMAX 

Multiple kinds of traffic are considered in WiMAX. 
QoS is negotiated at the service flow, especially at the 
establishment of the connection. A modulation and 
coding technique are set up. To satisfy different types of 
applications, WiMAX standard has defined four service 
classes of quality, namely Unsolicited Grant Service 
(UGS), Best Effort (BE), real-time Polling Service 
(rtPS) and non-real time Polling Service (nrtPS). The 
amendment to the IEEE 802. 16e standard (802. 16e 
2005) [1] on mobility includes a fifth type of service 
class, the extended real-time Polling Service (ertPS). 
This service is placed between the UGS service and 
rtPS service. It can serve real-time applications that 
generate periodic packets of variable size, the example 
given in the standard is that of a VoIP application with 
silence suppression. 

Some services like VoIP are very demanding in term of 
QoS, it cannot tolerate delay in data transmission while 
others have fewer requirements. 

Table 1 classifies different service classes of WiMAX 
and gives their description and QoS parameters. 

Table I. Service classes in WiMAX 



Service 


Description 


QoS parameters 


UGS 


Real-time data streams 
comprising fixed size data 
packets at periodic 
intervals 


Maximum Sustained 
Rate 

Maximum Latency 
Tolerance 
Jitter Tolerance 


rtPS 


support real-time service 
flows that periodically 
generate variable- size data 
packets 


Traffic priority 
Maximum latency 
tolerance 

Maximum reserved rate 



2 



http://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



(IJCSIS) International Journal of Computer Science and Information Security, 
Vol. 12, No. 9, September 2014 



ertPS 


Real-time service flows 
that generate variable- 
sized data packets on a 
periodic basis. 


Minimum Reserved Rate 
Maximum Sustained 
Rate 

Maximum Latency 
Tolerance 
Jitter Tolerance 
Traffic Priority 


nrtPS 


Support for non-real-time 
services that require 
variable size data grants 
on a regular basis 


Traffic priority 
Maximum reserved rate 
Maximum sustained rate 


BE 


Data streams for which no 
data minimum service 
level is required. 


Maximum Sustained 
Rate 

Traffic Priority 



III. QUALITY OF EXPERIENCE 

Quality of Experience (QoE, user Quality of 
Experience or simply QX) is a subjective measure that 
reflects the user satisfaction with the service provided 
(web browsing, phone call, TV broadcast, call to a Call 
Center). 

Today, assessing the quality of experience has become 
essential for service providers and content providers. 

A. Quality of Experience vs Quality of Service 
assessment 

QoS appeared in the 90 years to describe the quality 
of the network. Since that time the acronym QoS has 
been usually used to describe the improved 
performance realized by hardware and / or software. 
But with the rapid improvement of Media services, this 
measure has shown its limitations and many efforts 
have been made to develop a new metric that reflects 
more accurately the quality of service provided. This 
measure is called the QoE. 



end users who are asked to evaluate the overall 
perceived quality of the service provided, the most 
frequently used measurement is the MOS recommended 
by the International Telecommunication Union (ITU) 
[13], and it's defined as a numeric value evaluation 
from 1 to 5 (i.e. poor to excellent). 

Objective methods are centered on algorithms, 
mathematical and/or comparative techniques that 
generate a quantitative measure of the service provided. 

Peter and Bj0rn [5] classified the existing approaches of 
measuring network service quality from a user 
perception into three classifications, namely: Testing 
User-perceived QoS (TUQ), Surveying Subjective QoE 
(SSQ) and Modeling Media Quality (MMQ). The first 
two approaches collect subjective information from 
users, whereas the third approach is based on objective 
technical assessment. Figure 2 [2] gives an overview of 
the classification of the existing approaches. 



Measuring approaches from a user perspective 



Testing User-perceivedQoS Surveying Subjective QoE Modelling Media Quality 
(TUQ) ( SS q, (MMQ) 



L 




Li 



e.g. Questionnaires 



e.g. Perceptual Evaluation of 
Speech Quality (PESQ) 



Figure 2. The approaches for measuring network service quality from 
a user perception 



QoE is a subjective measure of a customer's experiences 
with a service according to his perception. Indeed, the 
notion of user experience has been introduced for the 
first time by Dr. Donald Norman, citing the importance 
of designing a user centered service [17]. 
Gulliver and Ghinea [9] classify QoE into three 
components: assimilation, judgment and satisfaction. 
The assimilation is a quality assessment of the clarity of 
the contents by an informative point of view. The 
judgment of quality reflects the quality of presentation. 
Satisfaction indicates the degree of overall assessment 
of the user. 

QoE and QoS have become complementary concepts: 
QoS indicators are used to identify and analyze the 
causes of network congestions while QoE indicators are 
used to monitor the quality offered to users. These two 
solutions used in parallel are a complete system 
monitoring. 

B. QoE measurement approaches 

Two main quality evaluation methodologies are 
defined, namely objective and subjective performance 
evaluation. Subjective assessments are carried out by 



IV. QoE-BASED SCHEDULING ALGORITHM 
MODEL 

In this section, we propose a QoE-based scheduling 
approach in WiMAX network, because it's observed 
that the existing scheduling algorithms take into 
account QoS but not user perception of the service 
provided, where every user has different subjective 
requirement of the system. 

A. Proposed QoE model 

In the proposed QoE-based model three QoE levels 
are used, each user has an initial maximum transmission 
rate, a minimum subjective rate requirement and a 
subjective threshold value. The traffic starts with a 
maximum transmission rate on each user. When the 
packet loss rate is greater than the user selected 
threshold (which is chosen at the beginning of the 
simulation), then each user checks if the transmission 
rate is higher than the minimum subjective requirement, 
if yes the transmission rate is decreased, otherwise it's 
remained at the same level. 
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In the other hand, if the packet loss rate is less than the 
selected threshold, then the user checks if the 
transmission rate is lower than the minimum subjective 
requirement, if yes the transmission rate is increased, 
otherwise it's remained at the same level. 
The threshold can be selected by the user as a 
percentage of the data transmission rate, for example, if 
the user introduces a value of 50 as a threshold then the 
threshold for packet loss rate is 50%. Figure 3 shows 
the activity diagram of the proposed model. 




Figure 3: Activity diagram of the proposed QoE-Model 



V. SIMULATION ENVIRONNEMENT 

A. Simulation Model 

In this paper, we evaluate the performances of the 
proposed QoE-based scheduling algorithm, as we 
consider the Wireless-OFDM PHY layer, our QoE- 
model is evaluated and compared with the popular 
WiMAX module developed by NIST (National Institute 
for Standards and Technologies), which is based on the 
IEEE 802.16 standard (802.16-2004) and the mobility 
extension (80216e-2005) [19]. Our simulation scenario 
consists of creating five wireless users connected to a 
base station (BS). A sink node is created and attached to 
the base station to accept packets. A traffic agent is 
created and then attached to the source node. The 
Network Simulator (NS-2) [15] is used. 

Finally, we set the traffic that produces each node. The 
first node has run with CBR (Constant Bit Rate) packet 
size of 200 bytes and interval of "0,0015", the second 
node has run with CBR packet size of 200 bytes and 
interval of "0,001", the third node has run with CBR 
packet size of 200 bytes and interval of "0,001", the 
fourth node has run with CBR packet size of 200 bytes 



and interval of "0,001" and fifth node has run with CBR 
packet size of 200 bytes and interval of "0,0015". The 
initial transmission rate that produces each node is 
about "133,3 Kbps", "200 Kbps", "200 Kbps", "200 
Kbps" and "133,3 Kbps" respectively. All nodes have 
the same priority. 

Each user has a minimum requirement, so the first user 
requires minimal traffic rate of "120 Kbps", the second 
"150 Kbps", the third "150 Kbps", the fourth "150 
Kbps" and the fifth "120 Kbps". 

The following table summarizes the above description 
about the produced and required traffic rate of each 
user. 



Table II. User's traffic parameters 



^<Craffic rate 
Users 


Initial traffic rate (Kbps) 


User minimum 
requirement 
(Kbps) 


User 1 


133,33 (200byte/0. 0015) 


120 


User 2 


200 (200byte/0. 001) 


150 


User 3 


200 (200byte/0. 001) 


150 


User 4 


200 (200byte/0. 001) 


150 


User 5 


133.33 (200byte/0. 0015) 


120 



We use five different thresholds 10%, 20%, 30%, 40% 
and 50%. 

We have used the QoS-included WiMAX module 
[4] within NS-2.29. This module is based on the NIST 
implementation of WiMAX [19], it includes the QoS 
classes as well as the management of the QoS 
requirements, unicast and contention request 
opportunities mechanisms, and scheduling algorithms 
for the UGS, rtPS and BE QoS classes. 

The resulted trace files are interpreted and analyzed 
based on a PERL script, which is an interpretation script 
software used to extract data from trace files to get 
throughput, packet loss rate, jitter and delay. The 
extracted results are plotted in graphs using EXCEL 
software. 

B. Simulation Parameters 

The same simulation parameters are used for both NIST 
and QOE-based scheduling algorithms, table 3 
summarizes the simulation parameters: 



Table III. Simulation parameters 



Parameter 


Value 


Network interface type 


Phy/WirelessPhy/OFDM 


Propagation model 


Propagation/OFDM 


MAC type 


Mac/802. 16/BS 


Antenna model 


Antenna/ Omni Antenna 


Service class 


UGS 


packet size 


200 bytes 


Frequency bandwidth 


5 MHz 


Receive Power Threshold 


2,025e-12 


Carrier Sense Power 


0,9 * Receive Power 


Threshold 


Threshold 
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Channel 


3,486e+9 


Simulation time 


200s 



C. Performance Parameters 

Main QoS parameters were analyzed in our simulation, 
namely average throughput, packet loss rate, average 
jitter and average delay. 

VI. SIMULATION RESULTS AND ANALYSIS 

We have performed various simulation scenarios in 
order to analyse and compare the proposed QoE-based 
scheduler with the NIST scheduler in term of average 
throughput, packet loss rate, average delay and average 
jitter in WiMAX network using UGS service class. 

In figure 5, we note that the average throughput in 
the case of the QoE-based scheduler algorithm is lower 
than for the NIST scheduler for all flows, whereas the 
third flow has the largest range between maximum and 
minimum values. 

For the flows 2 and 4 the throughput values are similar 
for both NIST scheduler and QoE-based scheduler, 
especially when the QoE threshold is 50%. 
The scheduler that takes into account the QoE varied 
the throughput for different users so as to match with 
the minimum subjective rate requirements of each user 
in order to reduce jitter, delays and packet loss. 




-NIST-scheduler 
-QoEtreshold 10% 
-QoEtreshold 20% 
-QoEtreshold 30% 
-QoEtreshold 40% 
-QoEtreshold 50% 



Figure 5. Average throughput 



The improvement is noticeable as shown in Figure 6 
when the QoE-based scheduler is used. The packet loss 
rate for all users is reduced while the packet loss rate is 
similar for both schedulers in the case of flows 3 and 5. 
The NIST scheduler gives lower performances 
compared with the QoE-based scheduler in term of 
packet loss rate. 




-NIST-scheduler 
-QoEtreshold 10% 
QoEtreshold 20% 
QoEtreshold 30% 
-QoEtreshold 40% 
-QoEtreshold 50% 



Figure 6. Packet loss rate 



It can be observed from the figure 7 that the proposed 
QoE-based scheduler algorithm has lowest values of 
average jitter compared with the NIST scheduler by 
applying different threshold levels, especially for the 
flows 1, 2 and 3. Average jitter values are identical for 
flows 4 and 5 for all the threshold levels. 



0,0025 




-NIST-scheduler 
-QoEtreshold 10% 
-QoEtreshold 20% 
QoEtreshold 30% 
-QoEtreshold 40% 
-QoEtreshold 50% 



Figure 7. Average Jitter 

As shown in figure 8, the QoE-based scheduler 
outperforms the NIST scheduler, the average 
transmission packets delay values still lowest in the 
case of QoE-scheduler, while the two schedulers have 
similar values for flows 4 and 5. 




■NIST-scheduler 
■QoEtreshold 10% 
QoEtreshold 20% 
QoEtreshold 30% 
^— QoE treshold 40% 
QoE treshold 50% 



Figure 8. Average Delay 
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VII. CONCLUSION 

In this paper, we have proposed a new QoE-based 
scheduler in order to manage the packet transmission 
rate for users in WiMAX network. When the packet 
loss rate exceeds some threshold, there are two cases, 
either the transmission packet rate is less than the 
minimum subjective rate requirement, then the user 
continue to transmit with the same packet transmission 
rate, otherwise he should reduce it. 

The simulations carried out show that the use of 
different levels of MOS improves the QoE provided to 
users of WiMAX network. The proposed QoE-model 
significantly reduced packet loss, delay and jitter, the 
transmission rate is reduced for each connection, until 
matching with its minimum subjective rate requirement. 

As a future work we may extend this study by adding 
other parameters like mobility models. 
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Abstract — Extracting fuzzy patterns from temporal datasets is 
an interesting data mining problems. An example of such 
pattern is yearly fuzzy pattern where a pattern holds in a 
certain fuzzy time interval of every year. It involves finding 
frequent sets and then association rules that holds in certain 
fuzzy time intervals (late summer or early winter etc.) in every 
year. In most of the previous works, the fuzziness is user- 
specified. However, in some applications, user may not have 
enough prior knowledge about the datasets under 
consideration and may miss some fuzziness associated with 
the problem. It may also the case that user may not be able to 
specify the fuzziness due to limitation of natural language. In 
this paper, we propose a method of extracting patterns that 
holds in certain fuzzy time intervals of every year where fuzzy 
time interval is generated by the method itself. The efficacy of 
the method is demonstrated with experimental results. 

Keywords- Frequent itemsets, Super imposition of time 
intervals, Fuzzy time intervals, Right reference functions, left 
reference functions, Membership functions. 

I. Introduction 

Among the various types of data mining applications, 
analysis of transactional data has been considered important. It 
is assumed that the dataset keeps information about users 
transactions. In a market-basket data set each transaction is a 
collection of items bought by a customer at one time. The 
notion proposed in [1] is to capture the co-occurrence of items 
in transactions, given two percentage parameters as minimum 
support and minimum confidence thresholds. One important 
extension above-mentioned problem is to include a time 
attribute. When a customer buy something, this transaction and 
it's time of transaction is automatically recorded. In [2], Ale et 
al has proposed a method of discovering association rules that 
hold within the life-span of the corresponding item set and not 
within the life-span of the whole dataset. 

In [3] the concept of locally frequent item sets has been 
proposed which are itemsets that are frequent in certain time 
intervals and may or may not be frequent through out the life- 
span of the item set. In [3] an algorithm has been proposed for 
finding such itemsets along with a list of sequences of time 
intervals. Here each frequent itemset is associated with a 
sequence of time intervals where it is frequent. Considering the 
time-stamp as calendar dates a method is discussed in [4] 
which can extract yearly, monthly and daily periodic or 
partially periodic patterns. If the periods are kept in a compact 
manner using the method discussed in [4], it turns out to be a 
fuzzy time interval. In [4], the author put a restriction that the 



intervals to be superimposed must have overlapping upto a 
certain specified extent. In this paper, we discuss such patterns 
and device algorithms for extracting such patterns. The 
algorithm can be applied for extracting monthly or daily fuzzy 
patterns also. Although our algorithm is quite similar to the 
algorithm discussed in [4], but in our case we removed the 
restriction on the intervals to be superimposed. We 
superimposed all the intervals that have non-empty intersection 
which turns out to be fuzzy intervals. The paper is organized as 
follows. In section-II, we discuss related works. In section-Ill, 
we discuss terms, definitions and notations used in the 
algorithm. In section-IV, the proposed algorithm is discussed 
along with complexity. In section-V, we discuss about results 
and analysis. Finally a summary and lines for future works are 
discussed in section- VI. 

II. RELATED WORKS 

The problem of discovery of association rules was first 
formulated by Agrawal el al [1] . Given a set 7, of items and a 
large collection D of transactions involving the items, the 
problem is to find relationships among the presence of various 
items in the transactions.. 

Temporal Data Mining [5] is an important extension of 
conventional data mining. By taking into account the time 
aspect, more interesting patterns that are time dependent can 
be extracted. The association rule discovery process is also 
extended to incorporate temporal aspects. The problems 
associated are to find valid time periods during which 
association rules hold and the discovery of possible 
periodicities that association rules have. In [2] an algorithm for 
the discovery of temporal association rules is described. For 
each item (which is extended to item set) a lifetime or life- 
span is defined which is the time gap between the first 
occurrence and the last occurrence of the item in the 
transaction in the database. Supports of items are calculated 
only during its life-span. Thus each rule has associated with it 
a time frame. In [3], the works done in [2] has been extended 
by considering time gap between two consecutive transactions 
containing an item set into account. 

Considering the periodic nature of patterns, Ozden [6] 
proposed a method, which is able to find patterns having 
periodic nature where the period has to be specified by the user. 
In [7], Li et al the authors discuss about a method of extracting 
temporal association rules with respect to fuzzy match i.e. 
association rule holding during "enough" number of intervals 
given by the corresponding calendar pattern. Similar works 
were done in [8] incorporating multiple granularities of time 



7 



http://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



intervals (e.g. first working day of every month) from which 
both cyclic and user defined calendar patterns can be achieved. 

Mining fuzzy patterns from datasets have been studied by 
different authors. In [9], the authors_present an algorithm for 
mining fuzzy temporal patterns from a given process instance. 
Similar work is done in [10]. In [11] method of extracting 
fuzzy periodic association rules is discussed. 

III. TERMS DEFINITIONS AND NOTATIONS USED 

Let us review some definitions and notations used in this 
paper. 

Let E be the universe of discourse. A fuzzy set A in E is 
characterized by a membership function A(x) lying in [0, 1]. 
A(x) for x<eE represents the grade of membership of x in A. 
Thus a fuzzy set A is defined as 

A={(x,A(x)),xg E] 

A fuzzy set A is said to be normal if A(x) =1 for at least one x 
G E. 

A fuzzy number is a convex normalized fuzzy set A defined 
on the real line R such that 

1. there exists ani 0 e^ such that A(x 0 ) =1, and 

2. A(x) is piecewise continuous. 

Thus a fuzzy number can be thought of as containing the real 
numbers within some interval to varying degrees. 

Fuzzy intervals are special fuzzy numbers satisfying the 
following. 

1. there exists an interval [a, b]a R such that A(x 0 )=lfor 
allx 0 e [a, b], and 

2. A(x) is piecewise continuous. 

A fuzzy intervals can be thought of as a fuzzy number with a flat 
region. A fuzzy interval A is denoted by A = [a, b, c, d] with a 
<b <c <d where A(a) = A(d) = 0 and A(x) = 1 for all x e [b, 
c]. A(x) for all x e [a, b] is known as left reference function and 
A(x) for x e [c, d] is known as the right reference function. 
The left reference function is non-decreasing and the right 
reference function is non-increasing. 

The support of a fuzzy set A within a universal set E is the 
crisp set that contains all the elements of E that have non-zero 
membership grades in A and is denoted by S(A). Thus 

S(A)={ xeE-A(x)>0} 

The core of a fuzzy set A within a universal set E is the crisp 
set that contains all the elements of E having membership 
grades 1 in A. 

Set Superimposition 

When we overwrite, the overwritten portion looks darker 
for obvious reason. The set operation union does not explain 
this phenomenon. After all 
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AkjB = (A-B)u(AnB)Kj(B-A) 



and in (AnB) the elements are represented once only. 

In [13] an operation called superimposition denoted by (S) was 
proposed. If A is superimposed over B or B is superimposed 
over A, we have 



A (S) B = (A-B) (+) (An5/ 2) (+) (B-A) 



(1) 



Where (AnZ?/ 2) are the elements of (AnB) represented twice, 
and (+) represents union of disjoint sets. 

To explain this, an example has been taken. 

If A= [ai, bi] and B= [a 2 , b 2 ] are two real intervals such that 
AnB ^(|), we would get a superimposed portion. It can be seen 
from (1) 

[a h bd (5) [a 2 , b 2 ]= [a (l) ,a (2) ) (+) [a (2) ,b {l) ] (2) (+) (b w ,b {2) ] ... (2) 

where ci(i)=min(a h a 2 ) a( 2 )=max(a\, a 2 ) 

b(\)=min(b\, b 2 ), and b( 2) =max(b\, b 2 ) 

(2) explains why if two line segments are superimposed, the 
common portion looks doubly dark [5]. The identity (2) is 
called fundamental identity of superimposition of intervals. 

Let now, [a h Z?i] (1/2) and [a 2 , b 2 ] (m) be two fuzzy sets with 
constant membership value Vi everywhere (i.e. equi-fuzzy 
intervals with membership value Vi). If [a u bi] n [a 2 , b 2 ] ^ 
then applying (2) on the two equi-fuzzy intervals we can write 

[aM (m XS)[a 2 ,b 2 f^^^ ... (3) 

To explain this we take the fuzzy intervals [1, 5] (1/2) and [3, 
7] (1/2) with constant membership value (1/2) given in figure- 1.1 
and figure-1.2. Here [1, 5]n[3, 7]= [3,5] *<|>. 



1/2 



Fig-1.1 



1/2 



Fig- 1.2 



A 



1/2 



V 



1/2 



Fig-1.3 Superimposed interval 

If we apply superimposition on the intervals then the 
superimposed interval will be consisting of [1, 3) (1/2) , [3, 5] (1) 
and (5, 7] (1/2) . Here the membership of [3, 5] is (1) due to 
double representation and it is shown in figure- 1.3 
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Let [X[, yi], i=l,2,. . .,n, be n real intervals such that [x. , y. ] 
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probability distribution functions followed by X k and Y k 



^ Generalizing (3) we get 

[x u Jl ] (1/n) (S) [x 2 , y 2 f /n) (S) ... (S) [x n , y n ] (] 



=[x (1) ,X (2) ) (m (+)[x (2) ,X (3 )) (2/n) (+). 



.(+)[x (r) ,x (r+1) ) (r/n) 
(+) ... (+) [x (n) , j(i)] (1) (+)Cy ( i),j(2)] ((n " 1)/n) (+) - (+) (j(„- r) ,j(„- r+ i)] (r/n) 
(+)...(+)( J(n . 2) ,j (n . 1) ] (2/n) (+)(j (n . 1) ,j (n) ] (1/n) 



(4) 



In (4), the sequence {x^} is formed by sorting the sequence 
{xi} in ascending order of magnitude for i=l,2,...w and 
similarly {y (i) } is formed by sorting the sequence {yi} in 
ascending order. 

Although the set superimposition is operated on the closed 
intervals, it can be extended to operate on the open and the 
half-open intervals in the trivial way. 

Lemma 1. (The Glivenko-Cantelli Lemma of Order 
Statistics) 

Let X = (X u X 2 , ..JQ and Y = (Y u Y 2 ,...,Y n ) be two random 
vectors, and (xi, x 2 ,...,x n ) and (y u y 2 , ...,y n ) be two particular 
realizations of X and Y respectively. Assume that the sub-a 
fields induced by X k , k = 1, 2, n are identical and 
independent. Similarly assume that the sub-a fields induced by 
Y k , k = 1, 2, n are also identical and independent. Let 
x (2 ), x (n) be the values of x u x 2 , x n , and y (1) , y (2) , y (n) 
be the values of y u y 2 , . . ., y n arranged in ascending order. 

For X and Y if the empirical probability distribution functions 
§i(x) and § 2 (y) are defined as in (5) and (6) respectively. Then, 
the Glivenko-Cantelli Lemma of order statistics states that the 
mathematical expectation of the empirical probability 
distributions would be given by the respective theoretical 
probability distributions. 




X < 

X (r _i) < X < X (r ) 
X > X( n ) 

y < 7(1) 

y(r-D < y < y(r) 
y>y(n) 



(5) 



(6) 



Now, let X k is random in the interval [a, b] and Y k is random in 
the interval [b, c] so that Pi(a, x) and P 2 (b, y) are the 



respectively. Then in this case Glivenko-Cantelli Lemma gives 



E\jj) x (x)] = P x {a, x), a < x <b, and \ 
E[0 2 (y)] = P l (b,ylb<y<c 



(7) 



It can be observed that in equation (4) the membership values 



of [X( r ), X( r+ i)] (r/n) , r - 1, 2, 



n-1 look like empirical 



probability distribution function (J>i(x) and the membership 
values of IjV-r), y(nr+i)] (r/n) ,r=l,2,....,/2-l look like the values of 
empirical complementary probability distribution function or 
empirical survival function [1- § 2 (y)]. 

Therefore, if A(x) is the membership function of an L-R fuzzy 
number A = [a, b,c\. We get from (7) 



A(x) = 



Pj(a,x), a < x < b 
\-P 2 (b,x), b<x<c 



(8) 



Thus it can be seen that Pj(x) can indeed be the Dubois-Prade 
left reference function and (1 - P 2 (x)) can be the Dubois-Prade 
right reference function [14]. Baruah [13] has shown that if a 
possibility distribution is viewed in this way, two probability 
laws can, indeed, give rise to a possibility law. 

IV. Algorithm proposed 

If the time-stamps stored in the transactions of temporal data 
are the time hierarchy of the type hour _day_month_y ear, then 
we do not consider year in time hierarchy and only consider 
day_month. We extract frequent itemsets using method 
discussed in [3]. Each frequent itemset will have a sequence of 
time intervals of the type [day_month, day_month] associated 
with it where it is frequent. Using the sequence of time 
intervals we can find the set of superimposed intervals [ 
Definition of superimposed intervals is given in section-3] and 
each superimposed intervals will be a fuzzy intervals. The 
method is as follows: For a frequent itemset the set of 
superimposed intervals is initially empty, algorithm visits each 
intervals associated with the frequent itemset sequentially, if 
an interval is intersecting with the core of any existing 
superimposed intervals [Definition of core is given in section- 
3] in the set it will be superimposed on it and membership 
values will be adjusted else a new superimposed intervals will 
be started with the this interval. This process will be continued 
till the end of the sequence of time intervals. The process will 
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be repeated for all the frequent itemsets. Finally each frequent 
itemsets will have one or more superimposed time intervals. 
As the superimposed time intervals are used to generate fuzzy 
intervals, each frequent itemset will be associated with one or 
more fuzzy time intervals where it is frequent. Each 
superimposed intervals is represented in a compact manner 
discussed in section-3. 

For representing each superimposed interval of the form 

[t (1) , t (2) f n [t {2) , t (3) ] 2/n [t {3) , t (4) f n [t {r) , t {r+l) ] r/n 



[f<V a W a) ,f' (2) ] n 



^ n - 2 \t {n - l) Y[t 



\n-\) t \n)^ln 



we keep two arrays of real numbers, one for storing the values 
f \ f\ t (3 \....t (n) and the other for storing the values t (1) , 
f (2) t (w) each of which is a sorted array. Now if a new 
interval [t, t ] is to be superimposed on this interval we add t 
to the first array by finding its position (using binary search) in 

the first array so that it remains sorted. Similarly t is added to 
the second array. 

Data structure used for representing a superimposed interval is 
struct superinterval 

{ int arsize, count; 

short */, *r; 

} 

Here arsize represents the maximum size of the array used, 
count represents the number of intervals superimposed, and / 
and r are two pointer pointing to the two associated arrays.. 

A. Algorithm 

for each locally frequent item set s do 

{L<— sequence of time intervals associated with s 
Ls <—set of superimposed intervals initially set to null 
It = L.get(); 

//It is now pointing to the first interval in L 
Ls.append(lt); 
while ((It = Lget()) != null) 
{flag = 0; 
while ((1st = Ls.get()) != null) 
if(compsuperimp(lt, 1st)) 

flag = 1; 
if (flag ==0) Ls.append(lt); 

} 

} 



{ superimp(lt, 1st); 
return 1; 

} 

return 0; 



The function compsuperimp(lt, 1st) first computes the 
intersection of It with the core of 1st. If the intersection non- 
empty it superimposes It by calling the function superimp(lt, 
1st) which actually carries on the super imposition process by 
updating the two lists associated as described earlier. The 
function returns 1 if It has been superimposed on the 1st 
otherwise returns 0. get and append are functions operating on 
lists to get a pointer to the next element in a list and to append 
an element into a list. 

B. Estimate of the work done 

Let n be the size of sequence of time intervals associated with 
a frequent itemset. Let m be the average number of intervals 
superimposed in one place. For each time interval of an 
itemset a pass is made through the list of superimposed 
intervals to check whether it can be superimposed on any of 
the existing superimposed intervals or not. Here each 
superimposed interval can generate fuzzy interval as shown in 
section-3. For this the intersection of the core of the 
superimposed time intervals and the current time interval is to 
be computed and this require 0(1) time. If the interval 
superimposes then the time boundaries are to be inserted in the 
two-sorted arrays maintained for the superimposed intervals. 
Searching in a sorted array of size m requires O(logra) time, 
inserting it in the current place requires 0(m) time. Two such 
insertions will take 0(2m) i.e. 0(m) time. Thus for one itemset 
this process will require Oinpm) time where p is the size of the 
set of superimposed intervals. Now p = 0(n) and m = 0(n), 
thus the overall complexity in the worst case is 0(?z 3 ). This 
will have to be done for each frequent item sets. 

V RESULTS OBTAINED 
For experimental purpose, we have used a synthetic dataset 

T10I4D100K, available from FIMI^ website. A summarized 
view of the dataset describing the number of items, the number 
of transactions, and the minimum, maximum and average 
length of transactions is presented in table 2. Since the dataset 
is non-temporal it cannot be used in its current form for our 
experimentation. The dataset mentioned in table 1 and some 
obtained results are presented in table 2. 

TABLE 1. T10I4D100K DATASET CHARACTERISTICS 



Datase 

t 


#Ite 

ms 


#Transactio 
ns 


MinlTI 


maxIT 
1 


Avg 
ITI 


T10I4 
D100K 


942 


100 000 


4 


77 


39 



compsuperimp(lt, 1st) 

{ if( I inter sect(lst, It) I /= null) 



http://fimi.cs.helsinki.fi/data/ 
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TABLE2. YEARLY FUZZY FREQUENT ITEMSETS FOR DIFFERENT SET OF 
TRANSACTIONS FOR ITEMSET { 1 } 



Data Size (No of Transactions) 


No. fuzzy time intervals 


10000 


1 


20000 


2 


30000 


2 


40000 


3 


50000 


3 


60000 


4 


70000 


4 


80000 


4 


90000 


4 


100000 


4 
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on Management of Data, Vol. 22(2) of SIGMOD 
Records, ACM Press, (1993), pp 207-216. 



Here we keep the life-span of our datasets as 5 years. 
Firstly, we take only 10,000 transactions and found that the 
itemset { 1 } has a superimposed intervals superimposed on one 
place and hence it has one fuzzy time interval where it is 
frequent. For 20,000 and 30,000 transactions the same itemset 
has two superimposed intervals and so two fuzzy intervals, 
Finally from 60,000-100,000 transactions, we get {1} is 
frequent in four fuzzy time intervals. 

VI CONCLUSIONS AND LINES FOR FUTURE WORK 

An algorithm for finding yearly fuzzy patterns is discussed in 
this paper. The method takes input as a list of time intervals 
associated with a frequent itemset. The frequent itemset is 
generated using a method similar to the method discussed [4]. 
However, in our work we do not consider the year in the time 
hierarchy and only consider month and day. So each frequent 
itemset will be associated with a sequence of time intervals of 
the form [day_month, day_month] where it is frequent. The 
algorithm visits each interval in the sequence one by one and 
stores the intervals in the superimposed form. This way each 
frequent itemset is associated with one or more superimposed 
time intervals. Each superimposed interval will generate a 
fuzzy time intervals. In this way we will have each frequent 
itemset is associated with one or more fuzzy time intervals. An 
example such yearly fuzzy pattern is cold-drinks is frequent in 
every summer. The nicety about the method is that the 
algorithm is less user-dependent i.e. fuzzy time intervals are 
extracted by algorithm automatically. 

Future work may be possible in the following ways. 

• Other type of fuzzy patterns namely monthly and 
Daily patterns can be extracted. 

• Clustering of patterns can be done based on their 
fuzzy time interval associated with yearly patterns 
using some statistical measure. 
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Abstract— Researchers have shown that practical mobile 
communication channels introduce errors that are 
concentrated in a certain locality rather than random 
errors. These are burst errors caused by deep fading of 
the wireless channel or a lightning strike. The existing 
Viterbi Algorithm (VA) capable of correcting random 
errors is inefficient in correcting burst errors and 
therefore resulting in unacceptable amount of residual 
errors. This paper presents an assessment of Non- 
Transmittable Codewords (NTCs) enhancement 
technique to VA in decoding the received signals 
subjected to burst errors that may occur in poor channels. 
A hard decision, 1/2 rate and constraint length K is equal 
to 3 Viterbi Algorithm decoding technique, Binary Phase- 
Shift Keying (BPSK) and Additional White Gaussian 
Noise (AWGN) are components used in MATLAB 
software based simulation when assessing the proposed 
technique. Applying 6NTCs to VA decoder enables the 
decoder to reduce 83.7 percent of its residual errors. 
However, the technique reduces the encoder's data 
transmission rate from 1/2 to 1/6. 

Keywords -Locked Convolutional encoder; Bust errors; 
Residual errors; No n - Tran sm Utah I e Codewords (NTCs); Viterbi 
Algorithm Decoding; Data Interleaver 

I. Introduction 

A pair of binary convolutional encoder and Viterbi decoder 
is one of the mostly used components in digital 
communication to achieve low error rate data transmission. 
Convolution codes are popular Forward Error Correction 
(FEC) codes in use today. This type of code was first 
introduced by Elias in 1955 [1], [2] . VA introduced in 1967 
[3], is known to be the maximum likelihood decoding 
algorithm of Convolutional codewords transmitted over 
unreliable channel [4]. VA is efficient in decoding random 
errors that occurred in a channel. However, the occurrence of 
burst errors in a received data block results in uncorrected or 
residual errors. Practical mobile communication channels are 
sometimes affected by errors which are concentrated in a 
certain locality rather than random errors [5]. These burst 
errors occur due to deep fading of the wireless channel or a 
strike of lightning in case of poor weather conditions or 



intensive interference with other radio communication 
systems in the environments [6] . 

VA decoder can increase its error correction capability by 
increasing its constraints length (its memory size) [7]. 
However, increasing the memory size beyond 10 leads the 
decoder into prohibitive delay (not preferred by most real time 
applications) due to exponential growth of its decoding 
computation complexity [8], [2]. For decades, VA had been 
dealing with burst errors using an interleaving utility support. 
Without an interleaver, burst errors drive a viterbi decoder's 
decision unit into a confusion state which leads the decoder 
into failure and thus resulting in residual errors [9] . 

The basic idea behind the application of interleaved codes 
is to shuffle the received data. This action leads to 
randomization of the received burst errors that are closely 
located and then apply the VA decoder. Thus, the main 
function done by interleaver at transmitter is to change the 
input symbol sequence. At the receiver, de -interleaver 
changes the received sequence to get back the original 
sequence as the one at transmitter. There are two main 
categories of Interleaver utilities in communication systems 
that are block and convolutional interleaver s. 

Block interleaver just writes the received data row by row 
in a matrix and read them out for transmission column by 
column. Fig. 1 demonstrates how the block interleaver and 
de-interleaver work to jumble the received data and disperse 
burst errors. If a sufficient interleaver depth (number of rows 
in the interleaver/de -interleaver) is applied, then Interleaver 
successfully removes the effects of burst errors and turns 
them into controllable pattern of random errors by VA 
decoder. 



Data input 
Row written 
Sequentially 



Data input 
Column 
written 
Sequentially 



,b,b,a,a,a- ► 



-a,d,c,b,a- 



Interleaver 



De-interleaver 



Data output 
Column read 
Sequentially 



Fig. 1. Block Interleaver-De-interleaver 



-,b,b,a,a,a>* 



Data output 
Row read 
Sequentially 
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As it is clear from fig. I, the columns are read sequentially 
from the interleaver. The receiver can only interpret a 
complete row when all the involved columns in the 
interleaver depth have arrived and not before that. In addition, 
receiver requires a considerable amount of memory in order 
to store the received symbols until all the involved rows in 
the interleaver depth have arrived. These facts raise two basic 
drawbacks to the technique, one is latency and another is the 
storage (large amount of memory). The mentioned drawbacks 
are of great concern and challenge to delay sensitive real time 
applications [10], [11]. 

The recently introduced convolutional interleaver [12] is 
reported to have reduced large part of the mentioned 
drawbacks. However, it is important to note that the 
application of interleaver is necessary only when the involved 
code fails to deliver the required quality of error correction. 

This paper assesses a technique of using Non 
Transmittable Codewords (NTCs) [9] to support VA decoder 
in decoding burst errors. The rest of this paper is organized in 
the following manner: Section II briefly discusses the 
encoding and decoding process using 1/2 rate and constraint 
length K=3 binary convolutional encoder and the viterbi 
decoder. Section III discuses the assessed technique that 
enhance VA decoder in reducing number of residual errors 
when it receives burst errors for decoding. Section IV 
describes the model used in building assessment simulation in 
MATLAB software. Section V discusses the results obtained 
from the simulation and section VI is a conclusion to these 
efforts. 

II. Encoding and Decoding Processes 

Binary Convolutional Coding is a technique that adds 
binary redundancy bits to original bit or bits sequence to 
increase the reliability of data communication. In this part of 
the paper, is the discussion of a simple binary convolutional 
coding scheme at the transmitter and its associated VA 
(maximum likelihood) decoding scheme at the receiver. 

A. Encoding 

There are various binary convolutional coding schemes 
having a designing data rate of 1/2. Fig. 2 shows the 
architecture of binary convolutional encoder with designing 
data rate 1/2, constraint length K=3 and generator polynomial 
is [7, 8] 8 which is equivalent to [111, 101] 2 . 




consideration is on the simplest encoder that has the 
following features: 

• Rate: Ratio of the number of input bits to the number of 
output bits. In this encoder, rate is 1/2 that means there 
are two output bits for each input bit. 

• Constraint length: The number of delay elements in the 
convolutional coding. In this encoder, K=3 where there 
are two delay elements (in memory) plus a single input 
bit. 

• Generator polynomials: These refer to the wiring of the 
input sequence with the delay elements to form the 
output. In this example, generator polynomials are [111, 
101] 2 . The output from the [1 1 1] 2 arm uses the exclusive 
OR (XOR) of the current input, previous input and the 
previous to the previous input. The output from the 
[101] 2 uses the XOR of the current input and the 
previous to the previous input only. 

Table 1 demonstrates the encoding process by showing the 
relation between the input and output bits and their 
corresponding encoder state transitions. The binary 
convolution encoding process usually starts when the encoder 
is at all zero state (i.e. 00). Suppose we want to encode the 
following stream of bits {1-1-0-1...}, it is obvious that when 
we pick the first input bit (i.e. 1) and the initial encoder state 
is all zeros (i.e. 00) then table 1 leads us to line SN. 2 where 
the input data is one (i.e. 1) and current state is all zeros (i.e. 
00). This line indicates that the next state to be used is one- 
zero (i.e. 10) and the decoders output is one-one (i.e. 11). 
Following the same scenario one can see that encoding {1-1- 
0- 1 ... } results into {11-01-01 -00 . . . } as output codewords and 
the encoder goes through the following state transitions (00- 
10-11-10-10...). 

Table I. Input, Output and State transition relations 



*S/N 


Input 


Current 


Next 


Output 




Data 


State 


State 


Data 


1 


0 


00 


00 


00 


2 


1 


00 


10 


11 


3 


0 


10 


01 


10 


4 


1 


10 


11 


01 


5 


0 


01 


00 


11 


6 


1 


01 


10 


00 


7 


0 


11 


01 


01 


8 


1 


11 


11 


10 


*SN : means a row serial number 



Fig. 2. Binary Convolutional code with Rate 1/2, K=3, Generator 
Polynomial [111, 101] 2 

Designing data rate, constraint length and generator matrix 
specifies the convolutional encoder. In this regards the 



B. Decoding 

VA decoding is a dynamic programming algorithm for 
finding the correct path from a number of given paths. The 
decoding process starts from all zeros (i.e. 00) state and opens 
out to other states as the time goes on. Therefore, State and 
trellis diagrams describe the internal operations of the Viterbi 
decoder. Fig. 3 is a state diagram of the decoding rate 1/2 and 
constraint length K=3 Viterbi decoder showing the allowed 
state transitions during the decoding process. 

To make the explanation easy, a hard-decision symbol 
inputs is applied and thus Hamming Distance (HD) metric 
weighs the path branches. HD is a bitwise comparison 
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between a pair of received codeword from a channel and 
allowed codewords from the decoder at that particular time 
interval. 




Node content corresponds to a shift register contents: 
S 0 = 00, Si=01, S 2 =10 and S 3 =11. 

Branch path label: 

vV/u -input/output(Codeword/Data) 



Fig. 3 Allowed state transition diagram of 1/2 decoding rate and Constraint 
length K=3 Viterbi Algorithm decoder 

There are two ways of calculating HD [8] to find 
codeword's bits similarities and differences. In this paper 
codewords' bits similarities method is used. Therefore, 
similar bits are granted a value (i.e.l HD) and non-similar 
bits have a zero value (i.e. 0 HD). In this case, results in each 
pair of comparison can be zero, one or two HDs. Fig.4 shows 
the calculation of HDs of each branch in each time interval 
and results are put in round brackets (i.e. (x)). After obtaining 
HD the algorithm continues as follows: 

• Using a relation in (1) to recursively calculate 
Cumulative Branch Metrics (CBM) in each time interval 
t by adding the obtained HD (t) to the CBM (t _i) in each 
path and put results in a square bracket (i.e. [x]). Note 
that, for the time interval t=l there is no CBM (t _i) , thus 
its value is zero. 

CBM {t) = CBM {t , + HD (t) (1) 

• At each node, find the path having the highest CBM up 
to time t by comparing CBMs of all paths converging to 
that node. In this step, decisions are used to recursively 
update the survivor path of that node. Equation (2) shows 
how the survivor path is obtained. 

PM w = max (CBM ] t) , CBM I ) (2) 

• Eventually when the decoder terminates at time interval 
t, survivor paths leading to each node are compared to 
obtain a Winning Survivor Path (WSP). Equation (3) 
shows this relation. If more than one node has the same 
highest WPM then one of them is randomly selected and 
data are extracted from it. 

WPM (f) = (PM ( } , PM * t) , PM 3 {t) , PM * t) ) (3) 

Figure 4 is a trellis diagram of 1/2 decoding rate and 
constraint length K=3 Viterbi decoder demonstrating the 
discussed steps. The same codewords obtained in subsection 
A of this part {i.e. 11-01-01-00...} are assumed to have been 



received with a bit error in the second bit of its third 
codeword {i.e. 11-01-00-00...}. 




Time Interval (t) 

0 12 3 4 



Kev 

Branch Path Labels 

xx/y - Allowed CodewortDutput 

(x) - Branch path Hamming Metric [x] - Cumulative path Hamming Metric 
Branch Path Tyne 

_£> . paths with lower Hamming metrics (Non surviving paths are discarded) 

► - Paths with higher Hamming metrics (Surviving paths are kept) 
^ " Winning surviving path 

Fig. 4 Trellis diagram of 1/2 decoding rate and Constraint length K=3 Viterbi 
Algorithm decoder 

III. Non Transmittable Codewords Enhancement 

Non transmittable Codewords (NTCs) technique can be 
applied at the data receiving machine where data encoded by 
a locked convolutional encoder arrives to be decoded by a 
VA decoder [9]. There are two different ways of locking a 
1/2 rate and constraint length K=3 convolutional encoder. 
The methods include either adding two zero bits (i.e. 00) to 
the encoder after every data bit to be encoded (for the lower 
end locked encoder) or adding two one bits (i.e. 11) after each 
data input bit (for the higher end locked encoder) [9]. All 
examples and simulation in this paper applies a lower end 
locked 1/2 rate and constraint length K=3 convolutional 
encoder. Fig 5 shows the locking process. Lock bits reset a 
decoder to all zeros state after every data input bit. Suppose 
we have {1, 1...} as binary data stream ready for the 
encoding process. Fig. 5 shows the encoder locking process 
where letter "D" stands for a data bit or bits, and "L" stands 
for the integrated lock bits. 

D D L L D L L 

t A \ r*i rS rS ^ rS rS 

{1-1...} 8 > {1 -0-0- 1-0-0...} 

Fig. 5. 1/2 rate and K=3 Convolutional encoder locking Process 

After the encoding process, all codewords corresponding 
to both data and lock bits are transmitted over a noisy channel 
to the receiving machine. This fact lowers the data 
transmission rate of the encoder from 1/2 to 1/6. NTCs are 
known all zero codewords for the lower locked convolutional 
encoders that are added to the received codewords to channel 
to enhance the VA decoder in correcting the received errors 
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D 



L L D L L 

HS HS r-^ 

{11-10-11-11-10-11...}- 



locked+2NTCs 



00-00-11-10-11-00-00-11-10-11...} 



Fig.6. 1/2 rate and K=3 Convolutional encoder locking and 2NTCs addition process 



successfully. NTCs can be added as a one, two, three 
codewords and so on; to each codeword corresponding to 
data bit. Suppose an example in fig. 5 {i.e. 1-0-0-1-0-0...} 
were encoded using the procedure discussed in table 1, the 
following codeword stream {11-10-11-11-10-11...} could be 
obtained for transmission. Fig. 6 demonstrates how 2NTCs 
are integrated to the received codewords corresponding to 
both data and lock bits before the received codewords are 
submitted for decoding. After the decoding process, all bits 
corresponding to the received lock codewords and the added 
NTCs are removed and the remaining data are submitted for 
use. In fig. 6 a letter "D" indicates a codeword corresponding 
to data bit, letter "L" is a codeword corresponding to lock bit 
and "N" is the added NTC. 

IV. Model Description and Simulation 

This work, evaluates error correction capability of VA 
decoder and the Enhanced VA (EVA) decoder supported by 
NTCs. Both VA and EVA decoders, decode codewords from 
the same 1/2 rate and constraint length K=3 binary 
convolutional encoder. However, the encoder is locked using 
the discussed technique for the EVA decoder to enable it to 
utilize the technique described in section III of this paper. 
The number of residual errors from both the decoders forms a 
performance comparison factor between the two decoders. 
Therefore, a decoder with less residual errors is identified. 

The Performance Measure of error correcting capability of 
the implemented codes is also given by Bit Error Rate (BER), 
which is obtained by the number of erroneous bits divided by 
the total number of transmitted bits. BER is affected by 
several factors including quantization technique used, noise 
in the channel, energy per symbol to noise ratio (Es/No ), code 
rate, and transmitter power level [13]. In their work [14], 
Akyildiz and colleagues showed that BER is directly 
proportional to the code rate and inversely proportional to 
energy per symbol noise ratio and transmitter power level. 
The use of a proper decoder that corrects errors controls the 
increase in BER in transmitted data. The difference in BER 
that can be achieved by using error correction codes to that of 
uncoded transmission is known as coding gain. 

A MATLAB software simulation that follows the 
procedures described in a block diagram described in fig. 7 
performs the following: 

• Generation of random binary data (i.e. 0 and 1) ; 

• Addition and removal of encoder lock bits and 
NTCs for the case of EVA; 

• Encode binary data using rate 1/2, generator 
polynomial [7,5] 8 Convolutional code; 

• Passing codewords through a noisy channel; 

• Modulate and demodulate the codeword signals 
using hard decision technique for decoding process; 



• Pass the received coded signals to Viterbi decoder 
and Enhanced Viterbi decoder; 

• Counting the number of residual errors from the 
output of Viterbi decoder and Enhanced Viterbi 
decoder; and 

• Repeating the same for multiple Signal-to-Nose 
Ratio (SNR) values 

All the comparisons assume that both the algorithms have 
the same execution time. Table 2 shows the list of all 
parameters chosen for simulation. 

Table II. Simulation Parameters 



Parameter 


Value 


Data length 


10 6 


Constraint Length (K) 


3 


Generator polynomial 


(7,5) 8 


Rate (r) 


1/2 


Encoder lock bits 


2 zero bits (i.e. 00) 


NTCs 


1,2,3,4,5,6,7,8,9,10,11,12 


Modulation/Demodulation 


BPSK 


Noise model 


AWGN 


Quantization 


Hard Decision 


Path evaluation 


Hamming Distance Metric 



A. Implementation of Codes 

Figure 7 illustrates the procedure of encoding and 
decoding in a communication system, where randomly binary 
generated data from binary data source are sent directly to the 
binary convolutional encoder and data that will be decoded 
by EVA are sent through the lock bit addition node before 
they are submitted to the encoder. Codewords from the 
encoder are submitted to the discrete channel. 



Binary Data Source 



Lock Bits 
Addition 



Channel Encoder 
(Convolutional Encoder) 




Channel Decoder 
(Viterbi Decoder) 



Binary Data Sink 



1 


f 


Lock Bits & NTCs 
Removal 





Fig. 7 Binary communication system block diagram used in simulation 
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The discrete channel modulates and demodulates the sent 
signal using Binary Phase-Shift Keying (BPSK) where zero 
bit (i.e. 0) is mapped to (-1) and one bit (i.e. 1) is mapped to 
(+1) and back, after the modulation signals are then released 
to the Additive White Gaussian Noise (AWGN) channel. 
Adding AWGN to signals in the transmission channel 
involves generating Gaussian random numbers, scaling the 
numbers according to the desired energy per symbol to noise 
density ratio (E s /N 0 ), and adding the scaled Gaussian random 
numbers to the channel symbol values. 

Received codewords from the discrete channel are ether 
sent to VA decoder directly or routed through NTCs node for 
adding NTCs. After decoding both lock bits and bits 
corresponding to the added NTCs are removed from data 
stream. Eventually, data obtained from the two streams (VA 
and EVA) are compared with the original generated data to 
determine the number of residual errors in each SNR value. 

B. Performance Measure 

The theoretical uncoded BER using a relation in (4) is used 
to compare the code gain [7]. Where, E\/N Q is expressed as a 
ratio of the involved factors; and "erfc" is a complementary 
error function in MATLAB software. For uncoded channel, 
E/Nq = Ei/Nq, since there is one channel symbol per bit. 



BER = 0.5 * erfc(^E b /N 0 ) 



(4) 



However, the coded channel uses a relation (5) in the 
simulation. 

E s /N 0 =E b /N 0 -l0log 10 {2) (5) 



V. Results and Discussion 

This section presents performance comparison between the 
VA and EVA decoders basing on their error correction 
capability in terms of BER and the residual errors. Fig. 8 
compares the BER of uncoded channel, VA and 6 NTCs- 
EVA. Table 3 presents the counted residual errors from both 
VA and 6NTCs-EVA and the improvement obtained in each 
SNR value. 

A. Code gain 

It is clear from fig. 8 that 6NTC-EVA decoder has the 
lowest BER curve with the highest constant code gain of 2 dB 
almost over all SNR values. While the VA decoder has 
highest BER (above theory uncoded) below 4dB and it is 
persistently higher than that of 6NTC-EVA. The minimal VA 
code gain is minus two (-2) dB. It is important to note that, 
around and below 4 dB, VA has a negative code gains 
because VA faces difficult in decoding burst errors in this 
area. However, 6NTCs-EVA performs better than both VA 
and the theory-uncoded curves with a difference of more than 
2dB. It can also be observed that, The VA and 6NTCs-EVA 
curves are far apart in lower SNR values (let say below 4 dB) 



and tend to come closer and closer as SNR values increase. 
This is because there are more errors generated in low SNR 
values which results in burst errors and create a great 
challenge to VA decoder. As the SNR value increases, few 
and random errors are generated and therefore VA decoder 
gains error correction power. 
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BER performance for theory-uncoded; VA and EVA in BPSK and 
AWGN 



B. Residual Error 

Table 2. Compares residual error obtained from the 
simulation between VA and 6NTCs-EVA. The results show 
that, 83.7 percent of total residual errors occurred in VA were 
corrected by applying 6NTC to the EVA. Averagely, 84.3 
percent of residual errors that occurred in and below 4 dB in 
VA were successfully corrected by 6NTC-EVA decoder 

Table III. VA versus 6NTCS-EVA residual errors 



Eb/No, 


VA 


6NTCs- 


Data Error 


Data Error 


dB 


Residual 


EVA 


Recovery 


Recovery 




Errors 


Residual 


Improvement 


Improvement 






Errors 


(Bits) 


(Percentage) 


1 


198187 


34407 


163780 


82.6 


2 


129604 


20417 


109187 


84.2 


3 


72308 


10650 


61658 


85.3 


4 


32492 


4824 


27668 


85.2 


5 


11581 


1985 


9596 


82.9 


6 


3094 


571 


2523 


81.5 


7 


614 


127 


487 


79.3 


8 


110 


16 


94 


85.5 


9 


7 


2 


5 


71.4 


10 


0 


0 


0 


0.0 


Total 


447997 


72999 


374998 


83.7 



C. Impact of Various NTCs Values on EVA 

NTCs can be added as one, two, three codewords and so 
on. Fig. 9 shows that the increase in number of NTCs to EVA 
has an increasing impact on the decreasing rate of the number 
of residual errors. It further shows that, there is no significant 
reduction of residual errors with the further increase in the 
number of NTCs after 6 NTCs. These results concur with the 
explanation given by researchers in their work [9] 
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Fig. 9 Impact of NTCs to EVA 

VI. Conclusions & Recommendations 

This paper presented and assessed the NTCs-enhancement 
technique to Viterbi Algorithm at the receiving machine. The 
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significant improvement to VA decoders. The enhanced VA 
can be used in industries that demand for error free 
transmission such as telemedicine. However, the technique 
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recommended to show the impact of the technique in 
different platforms and applications using Viterbi Algorithm. 
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Abstract — This paper explores the current existing models and 
technologies used in knowledge creation, knowledge sharing and 
knowledge dissemination practices in Higher Learning 
Institutions (HLIs) of Tanzania and proposes the model for the 
development of an Integrated Digital Academic Repository that 
enhances management, sharing and dissemination of Scholarly 
works produced in HLIs of Tanzania. The proposed model is 
presented and described in the paper. The study was carried out 
in three HLI using questionnaires, interview, observation and 
review of literatures. The findings show that, universities produce 
wide range of intellectual outputs such as research articles, 
learning materials, theses and technical reports. More than half 
population involved in the study create and store their 
intellectual outputs in personal computer hard drives while 
others store in internet cloud servers and departmental web 
servers. Moreover, sharing and dissemination of Intellectual 
output is done through internet i.e. Emails, social network, 
institution website and cloud servers, journal publication, 
seminar presentations, posters and printed copies in libraries. 
The identified methods proven to be unreliable and hindering 
availability and accessibility of scholarly works. Thus the 
proposed model provide a central system through which 
intellectual outputs will be collected, organized and archived and 
disseminated through it. The paper concludes with the 
conceptual framework of the proposed system, whereas design 
and development carried forward to be our future work. 

Keywords- Higher learning institution, intellectual output, 
knowledge management, knowledge sharing, model, digital 
repository 

I. Introduction 

In today's world, knowledge has been considered as a 
strategy resources that formulate the knowledge-based 
economy of countries. Knowledge has been identified as 
important as other factors of production such as land, labor 
and capital that requires management for the development of 
society [i, 2 ]. Knowledge based economy is an economy in 
which knowledge is being created, acquired, transmitted and 
used more effectively by individuals, enterprises, 
organizations and communities to promote economic and 
social development [3;5 ] 



Effective management, dissemination, sharing and use of 
knowledge assist in solving problems such as diseases, 
poverty, illiterate, environmental degradation and 
deforestation especially in African countries whose half 
population(50%) live in underprivileged societies, lacking 
access to information and suffer a lot when they fail to acquire 
and use information in their lives[3,4,5]. According to [6] ' 
every developed institution has a duty to place and 
disseminate knowledge through centers, which can easily be 
accessed in underprivileged society. 

Higher Learning Institutions (HLIs) have been described 
as the canters of creativity, innovation and the main producers 
of knowledge, both scientific and technological, that need a 
reliable, technological, affordable and accessible media to 
manage and disseminate their scholarly work to the world ^ 
Knowledge management is a practice of organizing, storing, 
and sharing of vital information, so that everyone can benefit 
from its use. Despite the number of practices in knowledge 
management process, knowledge sharing has been identified 
as the most important aspect of knowledge management 
process as it facilitate dissemination and application of the 
created knowledge [3] . As pointed out by [7]) knowledge has to 
be shared and disseminated to the designated community for it 
to be used, otherwise it end in itself. 

Researchers, student and faculty members in HLIs produce 
wide range of intellectual outputs such as research articles, 
datasets, theses, dissertation, reports, presentation and learning 
materials [8] .According to [5] Scholars work have to Scholars are 
tasked to Materials created should be available and accessible 
to Scholars and public in general . Scholars and Public should 
might apply the knowledge and use the disseminated 
information as a base for their research; this led to knowledge 
evolvement and economic development [4] However, with 
materials produced being scattered allocated, dis organized 
and lack of systemic integrated system to 

Despite, the rapid increase of digital materials produced 
in HLI [8]> availability and discoverability of the scholarly work 
remained to be a challenge. About 80%-85% of HLIs outputs 
such as research articles, manuscripts particularly from 
African countries have never been made accessible and 
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discoverable to the scholarly community and the world [9> 
10] Literature Scholars archive their intellectual outputs onto 
their personal computer hard drives, departmental web servers 
and in library shelves of which access is guaranteed to limited 
number of people or none and also lost because most of the 
materials are not well organized and have no clear 
documentation [810] Archaic dissemination techniques, 
subscription fee and publication charges have been pointed out 
among factors contributing to limited number of intellectual 
output and restrict accessibility over the scholarly works [4] 

Unavailability and limited access to scholarly works 
present problems such as repetition of works done by other 
scholars, limit knowledge evolvement, waste national 
resource, effort and money as well as negatively affect 
countries development 12] ' 

To this end, this study explores the current situation of 
intellectual output sharing process, identifying the challenges, 
and propose a new model enhancing management, sharing and 
dissemination of scholarly works created in HLIs of Tanzania. 
In order to achieve this objective, the study is divided into the 
following specific objectives: 

1) to identify types of intellectual output produced in 
Tanzanian HLI. 

2) to analyze how scholarly works created in Tanzania 
HLI collected, organized, archived, managed, shared and 
communicated to scholarly community and the world 

3) to identify the challenges associated with the current 
archiving and dissemination techniques of scholarly works 
used in Tanzanian HLI ; 

4) to propose new model enhancing knowledge sharing 
and dissemination in HLI 

5) to identify design requirements of a proposed model 

II. Methodology 

The study was conducted in three HLIs named: Nelson 
Mandela African Institution of Science and Technology (NM- 
AIST), Muhimbili Health of Allied Sciences (MUHAS) and 
Sokoine University of Agriculture (SUA). The chosen study 
area are the science universities offering undergraduate and 
postgraduate studies, though for our study postgraduate 
students, researchers and faculty members were involved 
considering that, they at most produce intellectual output. The 
composition considered the importance of including main 
stakeholders producing and managing intellectual outputs in 
HLIs. Questionnaires and interview guide questions were used 
as data collection tools. Questionnaires were administered to 
researchers, students and faculty member's whereby, library 
managers were approached for face-to-face interviews. 
Library managers were interviewed for detailed information 
and experience on how their institutions manage and 
disseminate scholarly works. Questionnaires were designed to 
capture information on types of intellectual output produced 
and how they had been stored and disseminated to public. 
Respondents were also asked about the challenges associated 
with the current archiving and dissemination techniques. 
Detailed literature review on materials related to topic was 



done to familiarize with existing digital contents sharing 
techniques and identifying challenges and weaknesses 
associated with each method .Data were summarized and 
analysed using the Statistical Package of Social Science 
(SPSS) and excel. Pictorial presentations of data were used to 
compare and derive important patterns that are to be used for 
further research. 

III. Findings and Discussion 
A. Respondent profile 

The respondent profile was meant to describe the 
respondent designation and educational level, from which 
authors were able to judge a respondent as appropriate 
individual or group who can create intellectual output, use, 
share and disseminate for other people to learn. A total of 95 
questionnaires were administered to students, researchers and 
faculty members of the studied institutions. The population 
consisted of 67% students, 28% researchers and 5% faculty 
members as shown in Fig 1 . 




Fig 1 Population composition 

Fig 2 shows response by level of education whereby 76% 
of respondents were master's students, 17% were PhD 
(candidates and holders) and 7% were administrative staff. 




■ PhD ■ Masters ■ Other 



Fig 2 Respondents' level of education 

We assumed the population is appropriate for production 
of intellectual outputs whereby, students may produce theses, 
dissertation, technical project reports, researchers come up 
with research findings and faculty members create learning 
materials such as lecture notes and presentations. We observed 
that, in HLI particularly for postgraduate studies, student must 
produce intellectual output such as research papers, theses and 
dissertation as criteria for graduation. Institution may take 
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advantage of the regulation by introducing a mandatory policy 
requiring scholars to self-archive and disseminate their 
research findings and learning materials through institutional 
database repository of which can be used as a source of 
scientific information to be used in research, academic or 
economic development. Moreover, a researcher can make use 
of the archived data as the base for next generation research 
development as well to display and publicize institution 
research product to the public. To achieve the goal, institutions 
need to establish a stable, permanent, accessible, affordable 
technology that will facilitate storage of large volume of 
intellectual outputs and dissemination to the large number of 
scholars and the public of which is a database. 

B. Intellectual output creation and 
Dissemination 

The findings show that, Scholars in Tanzanian HLI are 
producing different types of intellectual outputs including 
leaning materials, research articles, manuscripts, technical 
project materials. Fig 3 shows that, 28% of respondents had 
engaged in production of learning materials such as 
presentation and class notes, 23% had created research 
articles, 21% technical reports, 21% created manuscripts and 
7% did not specify type of intellectual outputs that they had 
ever created. 




Fig 3 Intellectual output produced in HLI 

The findings revealed that, abundantly knowledge and 
skills are being created by different scholars in HLI, which 
would be worth efficient to be preserved and disseminated to 
the designated community to be applied bot for academic and 
economic development. However, we usually come across 
research papers published by institutions on different journals, 
with no or little amount of other types of intellectual outputs 
produced by the same institutions. With this observation 
shows that, institutions does not considers other types of 
intellectual output as important as research articles. However, 
[7] suggested that, for material created to be of positive impact 
and useful ,one must collect, organize, archive, share and 
disseminate the disseminated and applied to the designated 
community to be used, otherwise it become a wastage of 
resources such as time ,money and effort engaging in 
production of objects knowing that they are not useful. 

Personal computer hard drive, internet: emails, cloud 
servers and printed copies were mentioned as technologies 
used to archive and manage HLIs intellectual outputs. As 



shown in Fig. 4. 51% of the respondents collect and archive 
their intellectual outputs in personal computer (PC) hard drive, 
30% store on internet particularly on cloud servers such as 
email, Google drive and drop box, 17% print and preserve 
hard copies of their works and 3% were not certain about the 
methods they are using. 



nit 




Fig 4 Intellectual output storage mechanisms 

It has been realized that scholars in HLIs share and 
disseminate their research and academic works via different 
technologies such Internet: emails, social network, institution 
website and cloud servers, journal publication, seminar 
presentations, posters and printed copies in libraries. The 
result in Fig 5 shows that, 35% of respondents use internet as 
their content dissemination mechanism, 30% publish their 
output onto journals, 24% presents their outputs in seminar 
and workshops, 4% print and archive their copies, 4% publish 
their works through posters and 4% were not certain about the 
media they use. The findings revealed that, produced 
intellectual outputs are widely scattered stored and 
disseminated. From the findings, we observed that, searching 
and retrieving of contents, which are widely scattered, it 
consume time and use much more bandwidth compared to 
when the resources are retrieved from single source, that is 
well organized. 

<rth*r EE 

pDilEfl EE 
printed copies EE* 
M miner presentation HEEEK73 

journal plication EHEH^H^^B^I 

f nternet ■^^^■■^^^■■^^■^31 
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Fig 5 Intellectual output dissemination mechanism 
Limited storage space, high publication cost, accessibility 
cost, limited internet connectivity, physical security, access 
and sharing limitation were mentioned as challenges in the in 
process of management and dissemination of intellectual 
outputs. As shown in Fig 6, 45% of respondents had ever 
experienced limited storage space to archive their output, 30% 
mentioned high publication and accessibility cost, 20% 
experienced limited access and sharing, 35% claimed 
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unreliability of their system (crash), and 10% identified 
physical security as their challenge. 




10% 

■ m 



Sharing and dissemination challenges 
Fig 6 Sharing and dissemination challenges 

IV. The proposed Integrated Digital Academic 
Repository Model for HLIs 

The proposed model developed based on Open Archive 
Information System Reference Model (OAIS).The OAIS 
reference model is a conceptual framework for a generic 
archival system which is committed to a dual role of preserving 
and providing access to information. Central to the reference 
model is an open archival information system (OAIS) which is 
"an organization of people and systems that has accepted the 
responsibility to preserve information and make it available for 
a Designated Community" [14]. The model describes the 
functional components which collectively fulfil the system's 
preservation and access responsibilities as shown in Fig 7. 




MANAGEMENT 



Fig 7 OAIS model 
The functional components of an OAIS include: 



> Ingest - services and functions that accept information 
submitted by Producers and prepare it for storage and 
management within the archive. In our case, Ingest 
involve the use of web user interface through which 
scholars (students, researchers and faculty) get access 
into a system and submit their contents of which need 
to be organized and stored in a database. 

> Archival storage- manages the long-term storage and 
maintenance of the digital materials entrusted to the 
OAIS, to make sure they remain complete and render 
able over the long term. Media refreshment and 
format migration for example are typical procedures 
that would be undertaken by the archival storage 
function. 

> Data management- maintains descriptive metadata to 
support search and retrieval of the archived content, 
and administration of internal operations. In our 
proposed model, content submitter is required to 
briefly describe the metadata/information that will be 
used to identify and retrieve the content from the 
database. 

> Preservation planning- designs preservation strategy 
based on evolving user and technology environment 

> Access -manages processes and services that locate, 
request, and receive delivery of the content within the 
archival store. 

> Administration - responsible for day-to-day 
operations and the co-ordination of the five other 
OAIS services. 

Having identified the challenges facing HLIs in managing, 
sharing and dissemination of intellectual outputs, we proposed 
a new model of intellectual output sharing and dissemination 
called integrated digital academic repository. The model assist 
the central management of intellectual output of HLI in a 
central database, whereby scholars from different universities 
in Tanzania will be able to put and access the materials from 
the repository. The proposed model will facilitate collection, 
C archiving and dissemination of intellectual output which are 
i created in Tanzania HLIs. Researchers, students and faculty 
s member's intellectual outputs will centrally be collected, 
I reviewed, archived in and disseminated to the scholarly 
M community and public. Integrated Digital academic repository 
E system (IDAR), enables researchers to communicate research 
R findings and find out what is been done by other researchers 
from their institutions and other universities on their field and 
other fields in HLIs of Tanzania. Scholars will have access to 
research results and learning materials of which can be used in 
academic or research activities as source of scientific 
information. Faculty members of various universities will be 
able to share their academic materials which are useful for 
research and academic purpose with scholars in HLIs and the 
globe over the internet as shown in Fig 8. 
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Fig 8 Proposed IDAR model 
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Fig 9 shows how the information flows from producers who for 
this case are students, researchers, and staff belonging to a 
particular HLIs. On web in the presence of internet, scholars 
from different universities and research centres submit their 



IOs of which are organized and put together in a database 
(repository). Access is also initiated from the established 
repository. 





Fig 9 Conceptual 
A. User requirement specifications 

This section presents user perspectives and requirements 
towards the development of the proposed system which were 
gathered during data collection. According to [15] 
requirements play vital important role and are a primary tool 
towards development of any information system. They define 



IDAR Information Flow 

what to be performed by a system and specify how it will be 
performed. Requirements are categorized into two groups 
namely: functional requirements and non-functional 
requirements. Functional requirements describe things, actions, 
tasks and functions that the system is required to perform or 
services the system should provide. Non-functional 
requirements describe properties and system constraints such 
as interface requirements, reliability, performance, storage 
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capacity, usability and system security. The non-functional 
requirement does not directly relate to the system 
functionalities though are the ones describing how the system 
function should be performed [16]. For the proposed model to 
be successfully implemented, the primary fundamental task 
was to identify the requirements defining the functional 
specifications to be incorporated into a design of the system. 
According to [17] for any information system to be successful 
developed, requirements must be gathered from different 
stakeholders and prospective users of the system. It has been 
observed that, as users getting involved in the system 
development process particular requirement elicitation, the 
possibility of developing a system which is usable and highly 
acceptable is high [18]. 

Therefore, we described and elaborated the proposed model to 
stakeholders (researchers, students and faculty), who in turn 
gave their inputs defining the type of the system needed, 
specifying system operations and organization of the contents 
that will be collected from scholars. Not only that but also 
users specified the mode of access to be provided to each user 
as follows: 

/. User Perspective towards Development of 
Integrated digital academic repository 

Despite the existence of digital repositories into some of 
the visited institutions: MUHAS and SUA that collects and 
disseminate scholarly works of their institutions and materials 
related to health and climate change respectively. Scholars of 
the stated universities joined hand with scholars at NM-AIST 
who were currently not possessing digital repository, 
supported the development of the proposed integrated digital 
academic repository as it expands and widen the search area of 
materials. The findings shows that 97% of respondents 
supported the development of the proposed repository whereas, 
3% did not support. Likewise the result show that 87% of 
respondent were extremely interested, 10% were somewhat 
interested, 3% were neither interested nor uninterested and no 
one mentioned not to be interested. From the statistics, we 
realized that user are in need of the proposed system and it 
brought attention to us that we required to deep for more 
information in order to get the requirement or services that 
user expects the system to provide 

ii. Type and format of the intellectual outputs 
From the study, respondents identified different types of 
intellectual outputs and different file format to be archived in 
and disseminated through the proposed repository. Theses and 
dissertations, technical project reports, research articles and 
leaning materials (lectures, seminar presentations) are the 
common items users demanded to be accommodated in a 
proposed system. Whereas, Text files (doc, pdf), Multimedia 
(video, audio) and Binary files were among the file formats 
identified by user that need to be uploaded and accessed from 
the system as shown in Fig 7. 




Fig 10 Intellectua81 output 



Hi. Intellectual output submission and 
organization 

Majority (86%) of respondents showed their interest of 
accessing reliable information that has been proved by 
expertise as correct and useful, therefore reviewers have been 
proposed to check the submitted contents before uploading 
Majority (86%) of respondents showed their interest of 
accessing reliable information that has been proved by 
expertise as correct and useful, therefore reviewers have been 
proposed to check the submitted contents before uploading to 
the system, 14% suggested the submitted materials to be 
uploaded directly by the corresponding submitter (researcher, 
student, faculty). From the study also, users suggested the way 
of organizing submitted digital contents whereby 41% of 
respondents demanded materials to be organized based on 
field of study, 26% authors name, 32% category wise (paper, 
books, articles, presentation) and 1% did not specify. 

iv. Intellectual output sharing and 

dissemination mode 

The result shows that, 60% of respondents need the 
archived intellectual outputs to be shared and disseminated to 
scholars within institution, outside institution and the public. 
Whereas, 33% requires the materials to be shared among the 
scholars within and outside institution, 3% prefer the 
materials to be shared only with scholars in institution and 3% 
were not certain about which mode to use. The fact that, not 
all produced materials are necessarily to be shared in globe, 
some are only necessary to a particular institution or people 
while others might be necessary to the globe. We considered 
all suggestion necessary and documented to be included in the 
further development stages, roles have been defined allowing 
user to specify whether the submitted materials should be 
available to all people accessing the system(public access) or 
institution members (institution) or to be archived and only 
accessed by the author/submitter of the 
work(private/individual access). 

B. System Functions 

This section summarizes list of functions to be performed by 
the proposed system following the requirements which were 
collected from users. 
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TABLE I. Summary of functional requirements 



No 


Function 


1 


System must be able to register users (institutions and 
institution members) 


2 


System must allow user to submit and access contents 


3 


System must be able to identify system user using 
username and roles played by each user 


4 


System must provide a means through which submitted 
contents will be reviewed and edited before uploading 
for use 


5 


System must be able to generate reports for 
management on the status and trends of institution and 
user content contribution 


6 


The system should provide a way for content 
contributors to register themselves 


7 


System must be able to accommodate contents of 
various formats i.e. Text ,multimedia and binary files 


8. 


System must be able to send notification to users 
confirming their registration and status of their 
submission through emails 



We used the use case diagram to present system functional 
requirements gathered from users. The Use Case Diagram is a 
Unified Modelling Language (UML) providing a pictorial 
representation of a system and how user interact with the 
system [19]. The use case diagram depicts the abstract view of 
the system as shown in Fig 1 1 . 

Integrated Digital academic Repository for higher learning 
Insitution 




Fig 1 1 Use Case for IDAR 
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C. Non-functional requirements 

The fact that, scholars need access to materials produced by 
different scholars from different institutions which are in large 
volume of information and storage devices with high storage 
capacity is considered in the new proposed system. Materials 
collected and archived in repository in long term as a source of 
scientific information in future studies, back up techniques are 
necessary and useful in case of system crash and easy 
recovery. Web based interface will be provided for easy 
access to materials. Intellectual outputs will be organized into 
module basis following the institution, schools and 
departments on which a submitter belongs to minimize 
searching and retrieval time. Ensuring the reliability and 
usability of the archived contents, peer reviewing and editing 
are considered necessary in order to ensure the correctness and 
usefulness of the archived materials. Moreover, user IDs and 
roles have been defined to serve as security mechanism in the 
proposed system. 

D. Intellectual Property Right (IPR 

From literatures intellectual property rights (IPRs) such as 
copyright and license have been identified as the biggest road 
blocking to self-archiving that limits the number of intellectual 
outputs to be populated into a repository. The fact that 
intellectual outputs consist of innovative, creativity and skills 
of individuals who in turn likely to be recognized, possess and 
have control over their scholarly works, it happened that the 
materials are copyrighted or licensed to a particular person or 
group of people. The access of item that is described under 
particular IPRs depends on the terms and condition the law 
protecting the product of which follows under IPRs. IPRs are 
rights granted to creators and owners for their intellectual 
creativity in the industrial, scientific, literary, and artistic 
domain. The work can be in the form of an invention, a 
manuscript, a suite of software or a business name. The Rights 
have been introduced intentionally to protect content creators 
right and at the same time allowing the public to access their 
creativity [20]. It has been observed that, scholar's sacrificed 
invaluable right in expense of publishing their contents to 
publishers. However for self-archiving of copyrighted contents 
the law does not state directly on what should be done, instead 
it leaves decision to the parties involved in the publishing 
agreement. The agreement may be publisher allowing authors 
to self-archive copy of their published work into their 
repository but not for commercial gain. Also work published 
on Open Access (OA) journals termed as free materials to be 
access by anyone (green). Materials published and declared 
gold requires subscription or pay per-view fee whereby green 
materials and unpublished materials have no restriction on 
user access .Thus for our proposed system, we suggest to 
archive both the published and unpublished materials while 
preserving the IPRs of authors. The gold published materials 
will be archived with the main intent of alerting users of their 
presence, therefore the title and abstract metadata of the work 
will be provided as a pointer for further inquiries. 



25 



http://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



(IJCSIS) International Journal of Computer Science and Information Security, 
Vol. 12, No. 9, September 2014 



V. Conclusion and Future Work 

The current situation of intellectual output sharing and 
dissemination in Higher Learning Institution of Tanzania has 
been identified and presented in this paper. Challenges and 
weaknesses facing sharing and accessing of academic and 
research works produced in Higher learning institution of 
Tanzania have been identified. It has been observed that, HLI 
produce large number of digital contents including research 
articled, theses, technical project reports, conference 
proceeding and learning materials .However access and 
availability of this materials remain to be a problem in 
scholarly community and the world. Universities lacks central 
system to collect, organize, manage and share their intellectual 
outputs. Materials are scattered allocated, published onto 
international journals of which access is limited by 
subscription fees and pay-per view fees, and some housed 
onto individuals hard drive or left unpublished in university 
library shelves providing of which access is granted to limited 
number of people or none around institution. 

To address the identified challenges, the new intellectual 
output sharing model that enhances organization, accessing, 
sharing and dissemination of intellectual outputs which are 
created in Higher learning institution of Tanzania has been 
proposed and presented in this paper. The proposed model 
provide a platform that manages and disseminates created 
output in a central way. Functional and non-functional 
requirements of the proposed system have been identified and 
summarized. Therefore, using the proposed model and the 
identified user requirements, authors of this work have 
considered the design and development of the information 
system that portray the identified characteristics to be their 
future work. 
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Abstract — Fuzzy association rules are rules of the form "IfX 
is A then Y is B" where X and Y are set of attributes and A, B 
are fuzzy sets that describe X and Y respectively. In most of 
fuzzy association rules mining problem fuzziness is specified 
by users. The users usually specify the fuzziness based on their 
understanding of the problem as well as the ability to express 
the fuzziness by natural language. However there exist some 
fuzziness which cannot be expressed using natural language 
due its limitation. In this paper we propose a method of 
extracting fuzzy association rules which cannot be traced by 
usual methods. We suggest a way of extracting these rules. 

Keywords- Fuzzy set, Association rules, Fuzzy interval, 
Certainty factor, Significance factor, Between Operation. 

I. Introduction 

The problem of association rule mining was defined by 
Agrawal et al [4]. Binary association rule mining is to find the 
relationships between the presences of various items within 
the baskets. A generalization of the binary association rules is 
motivated by the fact that a dataset is usually not restricted to 
binary attributes but also contains attributes with values 
ranging on ordered scales, such as cardinal or ordinal 
attributes. Quantitative association rules were defined for 
dealing with quantitative attributes [5]. In quantitative 
association rules attribute values are specified by means of 
subsets, which are typically intervals specified by hard 
boundaries. This is done by discretizing the domains of 
quantitative attributes into intervals. Generalizing from hard 
boundary intervals to soft boundary intervals has given rise to 
fuzzy association rules. A method for computing fuzzy 
association rules have been described in [1]. The fuzzy 
association rules are more understandable to human because of 
linguistic terms associated with fuzzy sets. The known fuzzy 
association rules mining techniques may however miss some 
interesting rules in the process as will be shown here. In this 
paper, we propose a method, which can extract these missing 
rules. 

The paper is organized as follows. In section II, we discuss 
briefly about the related works. In section III we review some 
definitions of basic terms and describe notations and symbols 
generally used with association rules mining.. In section IV, we 
discuss the problem that may arise in this method and then 
describe how to extract the missing rules. Finally in section V, 
we provide a conclusion and lines for future research. 

II. RELATED WORKS 

Replacing crisp sets (intervals) by fuzzy sets (intervals) leads 
to fuzzy (quantitative) association rules. Thus, a fuzzy 
association rule is understood as a rule of the form A—> B, 
where A and B are now fuzzy subsets rather than crisp subsets 
of the domains D x and D Y of two attributes X and Y 
respectively. Each attribute will be associated with several 
fuzzy sets. In other words, an attribute X is now replaced by a 
number of fuzzy attributes rather than by a number of binary 
attributes. Each element will contribute a vote between 0 and 1 
both inclusive to the fuzzy attributes. 

The approach made in [1], [2], [6] to generalize the support- 
confidence measure for fuzzy association rules is to replace 
set-theoretic operations, namely Cartesian product and 
cardinality, by corresponding fuzzy set- theoretic operations. In 
[1] the terms significance and certainty are used instead of 
support and confidence usually used with non-fuzzy 
situations.: 



III. TERMS AND NOTATIONS USED 

A. Some basic definitions, terms and notations related to 
fuzzinesS 

Let E be the universe of discourse. A fuzzy set A in E is 
characterized by a membership function A(x) lying in [0, 1]. 
A(x) for x e E represents the grade of membership of x in A. 
Thus a fuzzy set A is defined as 

A={(x,A(x)),xe£} 
Fuzzy intervals are special fuzzy numbers satisfying the 
following. 

1. there exists an interval [a, b]cz R such that A(x 0 ) =1 
for all x 0 e [a, b], and 

2. A(x) is piecewise continuous. 

A fuzzy interval can be thought of as a fuzzy number with a 
flat region. A fuzzy interval A is denoted by A = [a, b, c, d] 
with a < b < c < d where A(a) = A(d) = 0 and A(x) = 1 for all x 
e[b, c]. A(x) for all x e[a, b] is known as left reference 
function and A(x) for x <e [c, d] is known as the right reference 
function. The left reference function is non-decreasing and the 
right reference function is non-increasing [see e.g. [3]]. 

B. Some basic definitions related to association rules 
Consider a set / = {ii, i 2 ,...,i m }of items, and let a transaction 
f(data record) be a subset of / i.e. (c/. Let D x = {t e D \ X cz 
t) denote the set of transactions in the database D that contains 
the items Xc/. The cardinality of this set i.e | D x I is called 
the support of X in D. Given a minimum threshold o, X is said 
to be frequent if | D x I > a. An association rule is a rule of the 
form A -> B where A, B <^I and | D AkjB \ I \ D A \ > p where p is 
another used defined threshold. The support of an association 
rule A -> B is | D AuB \ . Sometimes the support is calculated as 
a fraction of the size of the dataset under consideration. In that 
case we have supp(A^ B) = \ D AuB \ I \ D \ . The confidence is 
the proportion of correct applications of the rule: 

conf(A^£)= \ D AKjB \l \D A \ 
Rather than looking at a transaction t as a subset of items, it 
can also be seen as a sequence (x 1? x 2 , ...,x m ) of values of 
binary variables X, with domain D x = {0, 1}, where Xj = 1 if 
the jth item, ij, is contained in t, otherwise xj = 0. 

The association rule mining problem has been extended to 
handle relational tables rather than transactions of items. In 
this case the problem is transformed into binary one in the 
usual way. However a database may contain quantitative 
attributes (such as age, salary) and in such cases transforming 
it into binary one will not be possible due to the large size of 
the underlying domain e.g integers. The discrete interval 
method [5] divides the quantitative attribute domain into 
discrete intervals. Each element will contribute support to its 
own interval. In fact, each interval A = [x b x 2 ] does again 
define a binary attribute X A (x) defined by X A (x) = 1 if xe A 
and 0 otherwise. In other words, each quantitative attribute X 

is replaced by k binary attributes X A such 

thatX cU^i^.. 

C. Significance factor 

The significance factor is calculated by first summing up all 
votes of each record with respect to the specified item set then 
dividing it by the total number of records. Let A be a set of 
fuzzy sets defined on a set of attributes X, then the 
significance factor of the pair <X,A> is calculated as 

Significance ^rj^ xK (fj[x . ]} / 
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where tj is the i-th transaction and ti[xj] gives the value of the 
jth attribute in t i? and m a is the membership function of Xj. 



\ m a j ^(fi\. x j'\\ m aj >CO 
0 otherwise 



where cd is a user specified threshold. 

D. Significance of a fuzzy association rule 
The significance of a fuzzy association rule A —> B, where A 
and B are fuzzy subsets is defined in [1] as the ratio of the sum 
of the memberships A(x) and B(y), provided (x, v)<e£>, to the 
total number of transactions in D. i.e. 

significance ( A B) = ^ A(x) ®B(y)/\D\ 

(x,y)eD 

where ® is II i.e. the mul operator. 

E. Certainty of a fuzzy association rule 

The certainty of a fuzzy association rules A —> B, is defined as 

certainty(A->£)= ^ A{x) ® B{y) I ^ A{x) 

(x,y)eD (x,y)eD 

F. Definition 

A fuzzy association, A^> B is said to hold if certainty(A^ B) 
is greater than or equal to min_cert and significance^ B) is 
greater than or equal to min_sig where the thresholds min_cert 
and min_sig are provided by the user. 

IV. LIMITATION OF EXISTING METHOD AND NEW APPROACH 

In this section we show how some interesting fuzzy 
association rules may be missed out due to the way in which a 
user has specified the input fuzzy sets. For example let us 
consider fuzzy sets defined on an attribute specifying the time 
in hours at which an event has occurred. The fuzzy sets might 
be mid-night, morning, afternoon, evening and night as 
described by the following diagram. 
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Let A and B are two adjacent fuzzy intervals specified by user. 
They are called adjacent if A(x) n B(x) ^<P for some x eE and 
there in no user specified fuzzy interval in between. We define 
an operation called "Between", which takes A and B as input 
and returns a fuzzy interval covering the portion between A 
and B with membership value less than 1 . To illustrate this we 
take A = [a h b b c b di] and B = [a 2 , b 2 , c 2 , d 2 ] be two adjacent 
fuzzy intervals and A(x) n B(x) ^<P for some x e£, they are 
shown in the figure below: 



Degree of membership 

J L 

Vlid-ninln Mom in si Aficrnoon Fvcniny 



NiLTlll 




fj;uii .vim fKuti Vitin J Jpni .- ; piiH-'rH]i Vpni 

Tune 

Fig. Definition of linguistic terms for attributes Time-of— 
occurrence. 

Consider two consecutive fuzzy intervals mid-night and 
morning. The points that contribute to both the intervals i.e the 
transaction between 3 a.m and 6 a.m will contribute much less 
than 1 to both the intervals. So if there are a reasonable 
number of events taking place in this period then due to the 
manner in which their supports are calculated, their 
contributions will be less in both the intervals. Thus some of 
the rules may be left undiscovered. This may happen in the 
case of non- fuzzy intervals also. For example if all the 
transactions are uniformly distributed in the interval [1,4] and 
if the user-specified intervals are [0, 2] and [2, 4], then it is not 
possible to identify [1, 4] as a frequent interval although it 
might be frequent if we would have considered [1, 4] as an 
interval. If we consider the fuzzy interval, which is in between 
Midnight and Morning (the between operation is defined in 
section-4.1), then the fuzzy interval lying between Midnight 
and Morning may turn out to be frequent. 
The known methods are very much user dependent due to fact 
that the fuzzy intervals are supplied by the domain expert. The 
domain expert may not have sufficient knowledge about the 
datasets. So he will supply the fuzzy intervals according to his 
limited understanding of the dataset and fuzzy intervals, which 
can be expressed using linguistic terms. So there is an every 
possibility that some of the association rules may be left 
undiscovered. Actually the time stamps lying between the 
fuzzy intervals are given less emphasis, which may not turn 
out to be appropriate always. 

A. Between operation 




Here ROU is the portion neither belonging A with full 
membership nor belonging to B with full membership. Our 
between operation takes the above two intervals as input and 
returns a fuzzy interval with core RU. We denote it by A(b)B = 
[bi, Ci, b 2 , c 2 ] 



where 



(A(b)B)(x)= 



(x-bi)/(c r bi), bi < x < Ci 
1, Ci<x<b 2 
(c 2 -x)/( c 2 -b 2 ), b 2 < x < c 2 




Given an underlying data set and fuzzy sets as input, our 
method is to consider the fuzzy set formed by joining 
consecutive fuzzy intervals as discussed above together with 
the input fuzzy intervals while calculating the significance 
factor and finally finding the association rules. In this process 
each and every time- stamp in the whole duration under 
consideration is given equal importance and every such time 
stamp contributes 1 to at least one fuzzy interval under 
consideration. In this process obviously all existing association 
in the data set will be detected. 



A. SUMMARY AND LINES FOR FUTURE WORKS 

An approach to finding fuzzy association rules that may be 
missed by other existing methods are discussed here. The 
algorithm extracts all the fuzzy association rules extracted by 
other methods and possibly some more. The other methods do 
not consider the region lying between two consecutive fuzzy 
intervals specified by user. If a record falls in this region its 
contribution will be less in both the fuzzy intervals. If 
sufficient number of records is lying in this region, the sum of 
their contributions may still be less than the user specified 
threshold and hence some association rules associated with 
such regions will be left undiscovered. In this paper, we 
proposed an approach, which takes into consideration all such 
regions falling between two consecutive fuzzy intervals 
specified by the user. So obviously this method gives all the 
fuzzy association rules specified by user plus it also extracts 
some extra association rules which is the beyond the scope of 
user. Future works may be done in the following two lines 

• attempt to find rules for fuzzy data 

• instead of taking the fuzzy intervals as input 
attempts may be made to extract the fuzzy 
intervals from the dataset in a natural way. 
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Abstract — In a traditional non-virtualized computer system the 
whole software stack is highly vulnerable to security breaches. 
This is mainly caused by the coexistence of deployed security 
systems in the same space as the potentially compromised 
operating system and applications that often run with adminis- 
trative privileges. In such a structure, compromising, bypassing, 
disabling, or even subverting deployed security systems become 
trivial. Machine virtualization provides a powerful abstraction 
for addressing information security issues. Its isolation, encap- 
sulation, and partitioning properties can be leveraged to reduce 
computer systems' susceptibility to security breaches. This paper 
demonstrates that machine virtualization when employed and 
synthesized with cryptography would preserve information con- 
fidentiality even in an untrusted machine. It presents a novel in- 
formation security approach called Virtualized Anti-Information 
Leakage (VAIL). Its objective is to thwart malicious software 
and insiders' information leakage attacks on sensitive files after 
decryption in potentially compromised computer systems. VAIL's 
defenses are evaluated against a variety of information leakage 
attacks including: (1) direct attacks launched on sensitive files 
from an untrusted virtual machine, and a compromised virtual 
machine monitor; and (2) indirect attacks exploiting covert 
storage and timing channels. Based on the security evaluation, 
it is concluded that VAIL effectively complied with the security 
requirements, and met its objective. 

Index Terms — Information Security; Information Leakage; 
Machine Virtualization; Malicious Software; Insider Threat 

I. INTRODUCTION 

Information leakage attacks represent a serious threat for 
their widespread and devastating effects. Significance of such 
attacks stems from the fact that they are committed by an 
organization's authorized computer users, and/or processes 
executing on their behalf. The diverse avenues that could be 
exploited to carry out these attacks add another barrier towards 
addressing them. 

In this paper focus is driven towards malicious software 
(malware) and the insider threat for being the most prominent 
perpetrators of information leakage attacks. Malware continues 
to form a major threat, whilst the insider threat is prevailing 
and represents a challenging unsolved problem for two main 
reasons: (1) insiders possess deep understanding of the tar- 
geted vulnerable processes; and (2) they are aware of systems' 



unpatched security vulnerabilities. Consequently, addressing 
malware and the insider threats is a key security requirement. 

To highlight the problem area, the following example is 
presented. An accountant created a spreadsheet file to maintain 
the company's bank account number, balance, total credits and 
debits, etc. He/she regularly downloads renewal statements and 
statements of account from the bank's website and edits the 
spreadsheet file. To prevent unauthorized disclosure and prop- 
agation of such sensitive financial information, the company 
mandates, according to its security policy, encrypting sensitive 
files. However, after decryption, sensitive files are still exposed 
to information leakage attacks. New undetected malware may 
attempt to leak out the file's contents after capturing its 
decryption password and/or opening it. In addition, being an 
authorized user, the accountant, or any of his co-workers may 
exploit their privileges to leak out such sensitive information 
to the company's competitors for personal or financial gain. 

This paper presents a novel information security approach 
called Virtualized Anti-Information Leakage (VAIL). Its ob- 
jective is to thwart malware and insiders' information leak- 
age attacks on sensitive files after decryption in potentially 
compromised computer systems. VAIL's basic idea lay in the 
method machine virtualization and cryptography are synthe- 
sized and employed to achieve this objective. VAIL's defenses 
are evaluated against a variety of information leakage attacks 
including: (1) direct attacks launched on sensitive files from an 
untrusted virtual machine, and a compromised virtual machine 
monitor; and (2) indirect attacks exploiting covert storage and 
timing channels. 

The remainder of this paper is organized as follows: Section 
II briefly explains machine virtualization and its security- 
related advantages. Section III provides an overview of the 
previous work that exploited machine virtualization in infor- 
mation security applications. Section IV presents VAIL; the 
security requirements, threat model and assumptions, eval- 
uation of design alternatives, VAIL structure and overview 
of its components, its encryption scheme, and operation. 
Section V evaluates VAIL's defenses against direct and indirect 
information leakage attacks. Finally, Section VI concludes the 
paper. 
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II. MACHINE VIRTUALIZATION 

Machine virtualization is a computer system abstraction that 
aims at detaching workloads (i.e., operating systems (OSs) 
and applications) and data from the functional side of the 
physical hardware [29]. Through machine virtualization, mul- 
tiple isolated guests (called virtual machines (VMs)) having 
heterogeneous unmodified OSs concurrently run on top of a 
virtual machine monitor (VMM), which resides directly above 
the host hardware (Figure 1). A VM behaves like a separate 
computer, in which its virtual resources are a subset of the 
machine's physical resources. 



VMi 




VM 2 




VMn 


Apps 


Apps 


Apps 








Guest OSi 


Guest OS2 


Guest OS n 




VMM 







Host Hardware 



Fig. 1. Machine virtualization structure 

The VMM manages, meditates access to, and creates mul- 
tiple instances of a physical resource than exists in reality. It 
presents to each guest a picture of the resource that corre- 
sponds to its own context. It provides the necessary mapping 
between the physical resources and VMs' virtual devices. It 
intercepts guests' privileged instructions on virtual devices 
and handles them before they are executed by the physical 
hardware under its control. Being in the highest privilege level, 
the VMM is able to isolate itself from the VMs and isolate 
the VMs from each other. 



A. Security-Related Advantages 

Isolation, encapsulation, minimal code size, and partitioning 
properties of machine virtualization can be leveraged to re- 
duce computer systems' susceptibility to security breaches as 
clarified below. 

- Minimized Attack Surface through Isolation 

Machine virtualization provides strong isolation between the 
VMM and its VMs, and between the VMs. This minimizes the 
attack surface to be confined to the potentially compromised 
VM(s), and prevents adversaries from expanding their 
attacks to adjacent VMs sharing the same host hardware. 
Furthermore, the isolation property could be exploited to 
prevent adversaries from disabling or even subverting the 
desired security functionality. Security applications could 
be deployed in dedicated VM(s) having privileged access 
to other potentially compromised VMs and to the physical 
hardware resources through the VMM. 



- VM Secure State Restoration through Encapsulation 

The VMM can encapsulate the execution state of a VM and 
resume the execution of a pre-configured VM image. Through 
leveraging the encapsulation property, security administrators 
could instantly restore a secure VM image by rolling a com- 
promised VM back to some previously checkpointed secure 
state; then resume the VM normally. Such capability facilitates 
systems administration and simplifies the lengthy complicated 
setup procedure needed for a physical server. 

- Maximized Security through Thin VMMs 

OSs are very complex and large. Their sizes fall in the range 
of tens of millions of lines of code (LOC), which makes it 
inappropriate to consider them secure [3]. Regarding VMMs, 
Xen [31], for instance, is implemented in under 50,000 LOC, 
whereas seL4 microkernel [18] has 8,700 lines of C code and 
600 lines of assembler. Other examples include BitVisor [28] 
that comes at 21,582 LOC, and Sec Visor [27] that comes at 
1,739 and 1,112 LOC for each of its two versions. Number 
of security vulnerabilities is directly proportional to the code 
size. Vulnerability reports confirmed this fact; showing, for 
instance, only 72 security vulnerabilities for Xen 4.x [25]; 
whereas showing 310 for Microsoft Windows 7 [26]. 

- Enhanced VM Isolation through Static Partitioning of 
Resources 

The VMM has the property of partitioning the system 
resources among its VMs. Partitioning could be accomplished 
statically or dynamically. In dynamic partitioning, resources 
are allocated to and de-allocated from VMs as needed during 
their execution. In static partitioning, each VM is provided 
access only to its fixed allocated resources, thereby providing 
strengthened isolation between adjacent VMs sharing the 
same host hardware [29]. 



III. RELATED WORK 

Information security researchers presented a number of ap- 
plications that focused mainly on leveraging the isolation 
and encapsulation properties of machine virtualization. These 
applications could be categorized into: (1) malware analysis, 
detection, and prevention; (2) intrusion analysis and detection; 
and (3) digital forensics as follows. 

A. Malware Analysis, Detection, and Prevention 

MAVMM [23] is a lightweight and special-purpose VMM 
for malware analysis. MAVMM supports a single guest OS 
and makes use of hardware-assisted virtualization to minimize 
VMM's code base and make it harder for malware to detect 
VMM existence. It has the capability of extracting many 
features from programs running inside the guest OS, such as 
fine grained execution trace, memory pages, system calls, disk 
and network accesses. 



32 



http://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



(IJCSIS) International Journal of Computer Science and Information Security, 
Vol. 12, No. 9, September 2014 



VMWatcher [16] uses VM introspection (VMI) to detect 
rootkits. It captures the current state of a guest OS and 
compares it with the state reported by the guest OS itself to 
detect any inconsistencies in the data structure of currently 
executing programs. Patagonix [21] relies on a monitor im- 
plemented in the VMM, and a separate VM in order to detect 
binaries covertly executing in a VM. As input, Patagonix 
requires information about the binaries it will identify that 
is represented in a list of known binaries. Any other binaries 
not in the list are identified as unknown. The results of the 
processes' identity-checking are sent to the user, who can 
compare Patagonix 's report on currently executing binaries 
with those reported by the guest OS. 

Sec Visor [27] and NICKLE [24] are VMM-based systems 
that aim at preventing rootkits from execution. Sec Visor is 
a tiny special-purpose VMM that supports only a single 
central processing unit (CPU) core and a single VM running 
a commodity OS. It leverages hardware-assisted virtualization 
to write-protect kernel code pages in the guest OS's memory, 
and approves loading of kernel modules only in case they have 
been pre-incorporated in a whitelist of computed cryptographic 
hashes. NICKLE is a trusted VMM; it maintains a separate 
guest-inaccessible shadow memory for a running VM to store 
authenticated kernel code. At runtime, NICKLE transparently 
redirects guest kernel instruction fetches to the shadow mem- 
ory. As a result, only authenticated kernel code will be fetched 
for execution, and rootkits that normally require executing 
their own attack code will not be executed. 

B. Intrusion Analysis and Detection 

Intrusions could be analyzed by logging and replaying VM 
execution. In case a VM was subject to an intrusion, logging 
records could be used to analyze the break-in process by 
replaying the VM instructions. This could help in determining 
the attack source, cause, and effects. ReVirt [5] is a VMM- 
based system that adopts such an approach. As an advance- 
ment to ReVirt, SMP-ReVirt [6] brings ReVirt 's logging and 
replaying functionality to multiprocessor VMs running on 
commodity hardware. 

Livewire [10] is a VMI-based intrusion detection system 
(IDS). It makes use of two VMs running above a VMM; an 
IDS VM, and the monitored VM. Through VMI, the VMM 
enables the IDS VM to inspect the state of the monitored 
VM, and monitor the interactions between its guest OS and 
virtual hardware. Livewire accomplishes its intended func- 
tionality through three components located in the IDS VM: 
(1) the OS interface library that interprets the hardware state 
exported by the VMM in order to provide an OS-level view 
of the monitored VM; (2) the policy engine that consists of a 
framework for building policies needed for the IDS; and (3) 
policy modules that implement these policies. 

Intro Virt [17] is an IDS. It leverages vulnerability -specific 
predicates to detect whether adversaries had exploited specific 
security vulnerabilities before releasing the relevant patches, 
or in the period between the patches release time and the 



application time. These predicates are written by software 
patch author(s) and executed at a particular invocation point 
within the OS or application code. Intro Virt uses ReVirt [5] to 
replay execution of a monitored VM. During replay, Intro Virt 
monitors the security predicates to find out if any of the 
announced vulnerabilities have already been exploited. It relies 
on VMI to monitor predicates checks, and inspect the state of 
the monitored VM. When a breakpoint is encountered, the 
VMM checkpoints the VM, invokes the predicate code and 
then rollbacks the VM back and resumes execution. Rolling 
back the VM ensures that executing a predicate does not affect 
the VM. 

C. Digital Forensics 

VMI allows digital forensics investigators to carry out live 
VM analysis, where the dynamic state of a target VM can 
be obtained without allowing it to detect that it is being 
monitored. For instance, the virtual introspection for Xen suite 
of tools [13] contains a set of utilities built over an inspection 
library that can be used from a Xen administrative domain 
to examine a running VM. Such capability can be used to 
reveal relevant forensics data, or discrepancies between the 
guest OS and its view from the VMM perspective that could 
be caused by malware such as rootkits. 

IV. THE PROPOSED INFORMATION SECURITY 
APPROACH 

This section presents VAIL; a novel information security 
approach that aims at thwarting malware and insiders' in- 
formation leakage attacks on sensitive files after decryption 
in potentially compromised computer systems. It begins by 
enumerating the intended security requirements that VAIL 
should meet. It proceeds by describing the threat model, 
evaluating the design alternatives, and justifying choices made. 
It explains VAIL structure and overviews its components. 
It explains its encryption scheme, and, finally, illustrates its 
operation. 

A. Security Requirements 

VAIL should meet five security requirements: 

1) Provide an OS -independent trusted boundary. Remove 
the OS from users trust base, whilst preserving informa- 
tion confidentiality after file decryption in a potentially 
compromised computer system. 

2) Prevent circumvention. Continue to function as intended 
in untrusted computer systems, and provide protection 
against spoofing attacks. 

3) Achieve 256-bit security strength. A brute-force attack 
on a cryptographic key would require 2 256 steps. A step 
refers to performing a single encryption operation on 
a given plaintext value with a key, then comparing the 
result with a given ciphertext value [2]. 
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4) Provide simplicity in usage. Provide automatic and trans- 
parent cryptographic operations to users, and limit their 
interactions with the proposed approach to increase its 
simplicity and usability. 

5) Provide compatibility with existing OSs and hardware. 
Provide direct applicability to commercial off-the-shelf 
OSs and applications, and without any special or addi- 
tional required hardware. 

B. Threat Model and Assumptions 

It is assumed that OSs and applications are untrusted and 
could be compromised. They will always contain exploitable 
security vulnerabilities. OS-authenticated users are deemed 
untrusted as well. They do not have full access to any 
trusted components (explained in Subsection D). Mai ware 
and insiders are the main adversaries. At runtime, they may 
provide spoofed user interfaces (UIs), and attempt to spy on 
users' authentication information. They may attempt to leak 
out information through: (1) breaching the isolation imposed 
on the encrypted sensitive files while in storage; (2) brute- 
forcing the encryption keys after compromising the VMM; or 
(3) exploiting covert storage and timing channels. 

C. Evaluation of Design Alternatives 

This subsection briefly evaluates VAIL's design alternatives; it 
justifies choice of: (1) virtualization type; (2) CPU virtualiza- 
tion technique; (3) VMM design; (4) stored data encryption 
approach; and (5) cryptographic algorithm and key length. 

- Choice of Virtualization Type 

In order to securely function in a potentially compromised 
computer system, isolation from adversaries is critical. Ma- 
chine virtualization contributes in achieving this requirement 
by leveraging its isolation security-related advantage. Accord- 
ing to the threat model, the OS is deemed untrusted, therefore 
OS virtualization is discarded. Application virtualization is dis- 
carded as well since each sandboxed application runs above the 
VMM, which runs as an application on top of the (untrusted) 
OS. Consequently, VAIL will adopt machine virtualization. 

- Choice of CPU Virtualization Technique 

CPU virtualization techniques involve how privileged and user 
instructions are handled. They are classified into full virtualiza- 
tion, paravirtualization, and hardware -assisted virtualization 
[11]. Full virtualization virtualizes the host hardware to enable 
concurrent execution of multiple heterogeneous unmodified 
guest OSs using interpretation or dynamic binary transla- 
tion; however, causing performance degradation. Eradicating 
the emulator's code in paravirtualization yields a virtualized 
system with a smaller footprint size and reduced overhead. 
However, it requires deep error-prone manual modification 
to guest OSs [14]. Pre -virtualization is an automated form 
of paravirtualization in which unmodified guest OSs run 



above an intermediary that, in turn, runs above the VMM 
[20]. Recently, Intel processors [15] provided support for full 
virtualization in hardware, thereby eliminating the need for 
interpretation and dynamic binary translation. According to 
the security requirements, VAIL should be directly applicable 
to commercial off-the-shelf OSs and applications, without any 
special or additional required hardware. Consequently, since 
pre-virtualization brings the benefits of both full virtualization 
and paravirtualization, it is the chosen CPU virtualization 
technique. 

- Choice of VMM Design 

A microkernel is chosen over a monolithic VMM for the 
following reasons [30]: (1) it offers a minimal layer over 
the hardware platform by focusing on providing basic system 
services, and eliminating OS kernel nonessential modules 
(e.g., device drivers); (2) it is safer and more reliable, since 
most services run as user rather than kernel processes; and (3) 
it is easier to validate, maintain, extend, and port from one 
hardware design to another. Consequently, taking account of 
the preceding considerations, VAIL will adopt machine virtu- 
alization in which VMs running unmodified OSs run above 
a pre-virtualization layer installed on top of a microkernel 
that runs directly above the host hardware (VAIL structure 
is explained in Subsection D). 

- Choice of Stored Data Encryption Approach 

Encryption approaches to protect data at rest fall into three 
categories: (1) whole disk encryption; (2) file system-level 
encryption; and (3) file-level encryption as follows: 

• Whole Disk Encryption 

In whole disk encryption (WDE), such as Microsoft's 
BitLocker Drive Encryption [22], the cryptographic solution 
is placed at the hardware device itself, or in the appropriate 
device driver. Entire disks or partitions including all system 
and user files are encrypted using user-supplied passwords. 
WDE suffers from several drawbacks: (1) insiders could leak 
out information during unattended sessions, or through stolen 
decryption keys; (2) WDE is vulnerable to cold boot attacks, 
where cryptographic keys could be stolen from the dynamic 
random access memory by exploiting its ability to retain its 
contents for several seconds after machine shutdown [12]; and 
(3) the computer system incurs the overhead of encrypting 
and decrypting contents of entire disks including unclassified 
information. 

• File System-Level Encryption 

In file system-level encryption, cryptography is performed at 
the file system layer of the OS. However, its implementations 
suffer from inconsistency of encryption algorithms across 
successive file systems. For example, files encrypted by 
the encrypting file system (EFS) of Windows XP OS 
Service Pack 1 or later cannot be decrypted using EFSs 
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incorporated in earlier Windows versions. This is because 
recent EFSs employ the Advanced Encryption Standard 
(AES) cryptographic algorithm [8], whereas older EFSs 
support only the expanded Data Encryption Standard and the 
Triple Data Encryption Standard algorithms [7]. 

• File-Level Encryption 

In file-level encryption (FLE) chosen files are manually en- 
crypted and decrypted with a specified key (e.g., UNIX crypt 
utility). Despite its simplicity, it has its own cons [19]: (1) there 
is no means to ensure users' commitment to delete plaintext 
versions of sensitive files after encryption, and re-encrypt them 
after each decryption; (2) after decryption, files are exposed 
to the risk of malicious manipulation; and (3) as the number 
of sensitive files increases, the overhead to manually decrypt 
and re-encrypt these files increases as well. 

However, VAIL will adopt the FLE approach for the follow- 
ing reasons: (1) FLE prevents adversaries from compromising 
other still encrypted files in case a file's key was revealed or 
stolen; (2) VAIL intends to provide automatic and transparent 
cryptographic operations to its users, where sensitive files are 
never stored in plaintext, they are automatically encrypted at 
creation time, and forcibly re-encrypted at close time; and 
(3) an organization's sensitive files are naturally expected to 
be clearly identified according to its security policy, much 
fewer compared to total number of business files, and with 
low update frequency. 

- Choice of Cryptographic Algorithm and Key Length 

VAIL will adopt the AES symmetric-key block cipher for 
three main reasons: (1) it is an approved Federal Information 
Processing Standard (FIPS) cryptographic algorithm for the 
protection of sensitive information [8]; (2) it is more efficient 
than the other FIPS approved cryptographic algorithm (i.e., 
the Triple Data Encryption Algorithm) [2]; and (3) it is an 
industry-leading encryption algorithm used in several versions 
of Windows OS to encrypt entire OS drives [22]. A key 
length of 256 bits is chosen for being necessary to achieve 
the required security strength. 

D. VAIL Structure and Overview 

In order to provide the intended protection, some trusted 
components are necessary. Figure 2 illustrates VAIL structure 
in which trusted components are drawn in solid lines, whereas 
untrusted ones are drawn in dashed lines. VAIL operates 
using two VMs; the Vulnerable VM, and the Quarantined VM. 
Both VMs run concurrently above the VMM (a microkernel) 
through a pre-virtualization layer. The VMM resides directly 
above the host hardware. The Vulnerable VM runs the vul- 
nerable OS. It represents the user's personal computer; it may 
be compromised and is untrusted. The Quarantined VM is re- 
sponsible for storing sensitive files. It has no user applications 
installed on it, and is trusted. The VMM, the pre-virtualization 
layer, and the hardware are considered trustworthy. 
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Fig. 2. VAIL structure 

Through VAIL a computer system has two states of 
operation that are named after the VM currently in focus: (1) 
vulnerable state (i.e., the low-confidentiality state); and (2) 
quarantined state (i.e., the safe editing high-confidentiality 
state). While the computer system is in vulnerable state, the 
vulnerable OS has access to all devices and can communicate 
freely over the network. While the computer system is in 
quarantined state, the vulnerable OS is blocked from sending 
output through the network or to external storage devices. To 
provide safe access to sensitive files, VAIL includes three 
more components to the above structure: VAIL Core, VAIL 
Server, and VAIL Client as follows: 

1) VAIL Core. It resides in the pre-virtualization layer. It 
verifies predefined escape sequences that are needed to 
create sensitive files and manage transition between vul- 
nerable state and quarantined state. It saves a vulnerable 
state before transition to quarantined state, and restores 
it back after the user closes a sensitive file. 

2) VAIL Server It runs inside the Quarantined VM. It cre- 
ates, encrypts, and stores sensitive files. It decrypts and 
opens them after authenticating VAIL users by verifying 
their supplied passwords through VAIL Password-Based 
Key Derivation and Verification (VPKDV) module. VP- 
KDV also derives keys from user- supplied passwords; 
these keys are used to encrypt files' encryption keys. 
VAIL Server stores the current vulnerable state before 
decrypting and opening a sensitive file in quarantined 
state, and re-encrypts sensitive files upon returning back 
to vulnerable state. 

3) VAIL Client. It runs in the Vulnerable VM. It is 
responsible for interacting with the user, storing links 
to sensitive files, and salting user-supplied passwords. 
A comprehensive explanation of VAIL operation is 
provided in Subsection F. 
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E. VAIL Encryption Scheme 

VAIL employs AES-256 in cipher block chaining (CBC) 
mode of operation. The following subsections explain key 
generation, file encryption, and user authentication for file 
decryption processes. 

- Key Generation 

VAIL employs the AES-CBC for two purposes. Firstly, to 
encrypt sensitive files using 256-bit keys called VAIL File 
Cipher Keys (VFCKs). Secondly, to encrypt each VFCK, 
using a user-supplied password-based 256-bit key called 
VAIL Master Cipher Key (VMCK). VFCK and VMCK 
generation as follows: 

1) Whenever a sensitive file i is created, VFCK; is assigned 
a 256-bit value generated from invoking a pseudo- 
random number generator (PRNG) running in the Vul- 
nerable VM. 

VFCKi = PRNGQ 

2) Whenever User x creates a VAIL account, alters or uses 
his/her password, it is concatenated with a fresh unique 
256-bit random bitstring (called the salt) that is gener- 
ated from invoking PRNG (). It is then iteratively hashed 
through VPKDV to generate the VMCK X . VPKDV is a 
module located in the VAIL Server to: (1) derive keys 
from user-supplied passwords (Figures 3 and 4), these 
keys are used to encrypt VFCKs; and (2) authenticate 
users by verifying their supplied passwords (Figures 6 
and 7). 

VMCK X CrtModPsw(u,p,z) 

As illustrated in Figure 3, User x creates a password P x 
of eight 7-bit ASCII characters (i.e., 56 bits). It is salted 
with S x that is assigned a 256-bit output from PRNG (). 
The salted password is then iteratively hashed with the secure 
hash algorithm SHA-256 [9] to compute a salted hash that is 
assigned to VMCK X . The user identifier (UID), the salt, and 
the salted hash are then stored in the password file (PSF) that 
resides in the Quarantined VM. VPKDV functions to salt and 
iteratively hash user- supplied passwords are defined in Figure 
4. 



^ PRNGQ j 



Quarantined VM 
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VPKDV 




UID 


Salt 


VMCK 
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function CrtModPsw ( u, p, z) 


// Create or modify the password 


input: u UID 




input: p password 




input: z number of iterations 




output: x VMCK 




s - PRNG () 


// salt from a PRNG function 


x = SaltlterHashPsw {s,p,z) 


// call password salting and iterative hashing function 


write (PSF, u, s, x) 


// insert into the password file 


return x 




function SaltlterHashPsw (s,p,z) 


input: s salt 




input: p password 




input: z number of iterations 




output: x 




yo <-e 


// Begin with the empty string in the 1 st iteration 


for a = 1, z do 




y a SHA-256 (y^ II s II p) 


// iteratively salt and hash the password 'z' times 


end do 




y z 




return x 


//sailed and iteratively hashed password 



Fig. 4. VPKDV functions to salt and iteratively hash user-supplied passwords 

- File Encryption 

A high-level view of VAIL file and key encryption process is 
as follows (Figure 5): 

1) The plaintext of file i (PFO is encrypted using AES-CBC 
and the 256-bit VFCK, as the key 

CF i =AES-CBC VFC K i (PF i ) 

2) VFCK, is encrypted using AES-CBC and the 256-bit 
VMCK X as the key. The 128 most significant (MS) bits 
and the 128 least significant (LS) bits of the encrypted 
VFCK, are then inserted in FCF_a and FCF_b respec- 
tively. These two fields are located in the header of the 
ciphertext of file i (CF,). 

write(FCF_a, Extract _MS(AES - CBC VM ck x (VFCKi)) ) 
write(FCF_b, Extract _LS(AES - CBC VM ck x (VFCKi))) 



1 



Fig. 5. High-level view of VAIL file and key encryption process 



Fig. 3. The process of deriving VMCK using VPKDV 
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User Authentication for File Decryption 



Creating a Sensitive File 



Files are decrypted only to users who successfully authenticate 
themselves to VAIL through VPKDV module. As depicted in 
Figure 6, when User x requests to open a sensitive file, he/she 
provides a unique UID and a password (P' x ). VPKDV uses 
the UID to index into the PSF and retrieve the user's S x . 
It verifies P' x by computing VMCK' X . It concatenates P' x 
with S x , iteratively hashes the result, and compares it with 
the correspondent previously stored VMCK X . If VMCK' X and 
VMCK X are identical, then the file decryption process begins. 
Otherwise, VPKDV terminates the decryption process. The 
VPKDV function to authenticate users through verifying their 
claimed passwords is defined in Figure 7. 




Fig. 6. Authenticating a user 



function UsePsw (u', p', z) 


//On Usage of a password 


input: u' claimant UID 




input: p' claimant password 




input: z number of iterations 




a <— seek (PSF, "u ==u' ") 


//use the UID to index into the password file 


(s, x) <- PSF [a] 


//retrieve the user's stored salt and VMCK 


x' = SaltlterHashPsw (s,p',z) 


//compute the claimant's VMCK 


if x == x' then 




begin the file decryption process 




Else 




Terminate 




end if 




return true 





Fig. 7. VPKDV function to authenticate users by verifying their passwords 



F VAIL Operation 

This subsection details VAIL's operation to clarify the interac- 
tions between VAIL Core, VAIL Server, and VAIL Client to 
thwart malware and insiders' information leakage attacks on 
sensitive files after decryption. Some steps involve requesting 
the user to enter escape sequences as a security measure 
to prevent spoofing by the untrusted Vulnerable VM (see 
Subsection A in Section V). 



This process aims at assuring that sensitive files are always 
stored encrypted in the Quarantined VM. Upon receiving a 
user request from the Vulnerable VM to create a sensitive file, 
the following steps take place (Figure 8): 

1) VAIL Core disables the Vulnerable VM's network, and 
external storage devices drivers (e.g., compact disk- 
rewriteable), and starts the Quarantined VM's Dynamic 
Host Configuration Protocol (DHCP) client service. 

2) VAIL Client captures the file name from the user. 

3) VAIL Client sends a request to VAIL Server to create a 
new sensitive file. 

4) VAIL Server checks sufficiency of free space on the 
Quarantined VM, if it is adequate, it approves the 
request. 

5) VAIL Client requests the user to enter a predefined 
escape sequence (e.g., Alt + Del) that will be captured 
by VAIL Core. 



—File Password (Request & Capture)- 



File Creation 
Request 



Escape Sequence 
(Request & Capture) 

JL 



VAIL Client ! 



—File Creation Request— 
-File Creation Approval- 
— File's Full Path 



VAIL 

Server 



Escape Sequence 



Return Ul Focus VAIL Core 



Give Ul Focus 




Fig. 



VAIL process of creating a sensitive file 



6) Upon receiving and verifying the escape sequence, VAIL 
Core gives the Ul focus to VAIL Server. 

- If VAIL Core does not receive the escape sequence, 
or receive an incorrect one from VAIL Client, then it 
will not give the Ul focus to VAIL Server. Instead, it 
will display an alert noting that it is an unsafe process, 
terminates it, and undo the first four steps. 

7) VAIL Server creates the file in the Quarantined VM, 
encrypts it, and requests the user to select a password 
that will be salted and iteratively hashed to encrypt the 
file's key (i.e., VFCK). 

8) VAIL Server sends to VAIL Client a link to the file, 
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which is its full path. 

9) VAIL Server requests VAIL Core to return the UI focus 
back to VAIL Client. 

10) VAIL Core stops the DHCP client service on the Quar- 
antined VM, re-enables Vulnerable VM's network and 
external storage devices drivers, and returns the UI focus 
back to VAIL Client. 



- Opening a Sensitive File in Quarantined State 

This process aims at preventing adversaries from: (1) capturing 
users' files' decryption passwords; and (2) leaking out sensi- 
tive information externally after opening sensitive files. At this 
stage, one or more sensitive files reside in the Quarantined 
VM, and have links to them in VAIL Client. The transition 
process from vulnerable state to quarantined state comprises 
the following steps (Figure 9): 

1) VAIL Core disables Vulnerable VM's network and exter- 
nal storage devices drivers, and starts Quarantined VM's 
DHCP client service. 

2) VAIL Client sends VAIL Server a user request to open 
a sensitive file. The request includes the file's full path. 

3) Upon VAIL Server's approval, VAIL Client requests the 
user to enter a predefined escape sequence that will be 
captured by VAIL Core to perform the state transition. 



—File Password- 



File Open 
Request 



Escape Sequence 



i 

VAIL Client I 



—File Open Request— 



VAIL Server 



XCapture Vul. Stated 



Return UI Focus 



VAIL Core 



Give UI Focus . 




4) 



Fig. 9. VAIL process of opening a sensitive file 

VAIL Core verifies the escape sequence and gives the 
UI focus to VAIL Server on the Quarantined VM. 
Request for an escape sequence prevents the potentially 
compromised Vulnerable VM from spoofing a transition 
to capture the file's password. 

- If VAIL Core does not receive the escape sequence 
or receive an incorrect one from VAIL Client, then it 
will not give the UI focus to VAIL Server. Instead, it 



will display an alert noting that the system is not in 
quarantined state, terminate the process, and undo the 
first step. 

5) VAIL Core saves the current vulnerable state in VAIL 
Server. 

6) VAIL Server requests and obtains the file decryption 
password from the user. 

- If the user's supplied password is incorrect, then VAIL 
Server will terminate the process, and VAIL Core will 
return UI focus to VAIL Client, and undo the first step. 

7) VAIL Server notifies VAIL Client that the state transition 
is complete; it decrypts and opens the file. 

8) VAIL Core returns UI focus to VAIL Client. 

- Returning Back to Vulnerable State 

This process overwrites the vulnerable state after it reads 
confidential data in plaintext. After the user closes a sensitive 
file, vulnerable state is rolled back to the state that was 
previously saved before opening the file. Since sensitive files 
are stored in the Quarantined VM, therefore all changes that 
were made to the Vulnerable VM, except those that were made 
to the file, will be discarded. Such security measure thwarts 
adversaries' attempts to leak out sensitive information locally 
by storing sensitive files in the Vulnerable VM after opening 
them in quarantined state. 

When the user enters an escape sequence, VAIL begins 
returning from quarantined state back to vulnerable state. This 
time the escape sequence is used to prevent the vulnerable OS 
from controlling the state transition process (see Subsection B 
in Section V). After the user enters the escape sequence, the 
following steps are taken (Figure 10): 

1) VAIL Server receives the escape sequence and passes 
it along to VAIL Core for verification, which in turn 
informs VAIL Client of the required transition. 

2) VAIL Server waits for a static time period after the 
Vulnerable VM flushes writes of the disk cache to the 
sensitive file on the Quarantined VM (see Subsection B 
in Section V). 

3) VAIL Server re-encrypts the sensitive file. 

4) VAIL Core restores the vulnerable state that was saved 
before entering quarantined state. Thus, all changes that 
were made in the Vulnerable VM while the system was 
in quarantined state will be overwritten. 

5) VAIL Core stops the DHCP client service on the Quaran- 
tined VM, and restarts the VMM's DHCP and Network 
Address Translation services. 

6) VAIL Core suspends and then reboots the Vulnerable 
VM to reset states of its peripheral virtual devices (see 
Subsection B in Section V). 
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user x 



-Escape Sequence— 



VAIL Client i 



Overwrite Vul. State 
and Return Ul Focus - 



VAI L Core Restore Vul. Stated 




Fig. 10. VAIL process of returing back to a vulnerable state 



- VAIL prevents adversaries from capturing files decryption 
passwords by requesting the user to enter an escape sequence 
to prevent spoofing by the untrusted Vulnerable VM. In case 
the Vulnerable VM becomes compromised, it would ignore 
requesting the escape sequence and display a spoofed UI 
that would resemble what the user is accustomed to. Such 
a spoofing attack attempts to deceive the user into interacting 
with a malicious process instead of with VAIL Client in order 
to capture his/her file decryption password while the system 
is still in vulnerable state. 

• While the computer system is in Quarantined State 

- VAIL prevents adversaries from externally leaking out sen- 
sitive information from the Vulnerable VM after opening 
sensitive files in quarantined state. This is accomplished by 
VAIL Core through disabling the Vulnerable VM network and 
external storage devices drivers. 



7) VAIL Core resumes execution of the Vulnerable VM, re- 
enables its network and external storage devices drivers, 
and returns UI focus to VAIL Client. 



V. SECURITY EVALUATION 

This section evaluates VAIL's defenses against a variety of 
information leakage attacks including: (1) direct information 
leakage attacks on sensitive files launched from: (a) the 
Vulnerable VM; and (b) a compromised VMM; and (2) 
indirect information leakage attacks exploiting covert storage 
and timing channels. 

A. Direct Attacks 

VAIL provides its users transparent access to sensitive files that 
are forcibly and centrally stored encrypted in the Quarantined 
VM, whereas users are given the illusion of accessing them 
from the Vulnerable VM. However, in case adversaries knew 
the files locations, they may attempt to launch direct attacks 
against them from either the Vulnerable VM or a compromised 
VMM as follows. 

- Attacks Launched from the Vulnerable VM 

From the Vulnerable VM, adversaries may attempt to breach 
the isolation imposed on encrypted sensitive files stored in 
the Quarantined VM. As detailed in Subsection F in Section 
IV, VAIL prevents such attacks in both operational states as 
follows: 

• While the computer system is in Vulnerable State 

- VAIL Core being trusted, isolated, and highly privileged 
stops the DHCP client service on the Quarantined VM to 
prevent any communication between the Vulnerable VM and 
the Quarantined VM. 



- Attacks Launched from a Compromised VMM 

The VMM is an attractive target for adversaries. A com- 
promised VMM imposes a severe threat to the Quarantined 
VM and its workload as adversaries acquire VMM's most 
privileged access level. As a result of compromising the VMM, 
an adversary may attempt to: 

• Retrieve VFCK; from CF ; header 

VAIL confronts such an attack by two countermeasures: 

- VFCK; Encryption. By encrypting VFCIQ using a strong 
encryption algorithm (i.e., AES), with a long key (256-bit of 
VMCK X ). Such operation strengthens data security by: (1) 
binding every file only with its owner; and (2) preventing 
illegitimate file access in case it was copied to another machine 
other than VAIL's Quarantined VM. 

- Segregation of the Encrypted VFCK ; . VAIL splits the en- 
crypted VFCIQ into two bitstrings. The 128 MS bits and 128 
LS bits are inserted into FCF2_a and FCF2_b respectively. 
These two fields are located in the header of CF ; . 

• Obtain VFCK; before and after encryption to brute-force 
VMCK X . 

The objective of such an attack is to compromise all the 
files that were created by User x . VAIL makes finding out 
VMCK X almost impossible since the adversary would have 
to test 2 256 possible keys that require 2 256 attempts, which is 
computationally infeasible. 

• Breach VMCK X for a specific file. 

Even if such an attack was successful, VAIL will prevent the 
adversary from compromising the user's other still encrypted 
sensitive files for two reasons: 

- Adoption of the FLE Approach. 
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- VPKDV's salting and iteratively hashing the user- supplied 
passwords. This generates a strong unique per-file 256-bit key 
(i.e., VMCK X ). In addition, it preserves passwords' uniqueness 
even if a user chose identical passwords for several sensitive 
files, or in case multiple users chose identical passwords. 
Furthermore, it adds another layer of protection by hardening 
passwords against dictionary -based attacks through increasing 
their length. This increases the size of the search space, thereby 
making password-cracking computationally expensive. 



B. Indirect Attacks Exploiting Covert Channels 

A covert channel is an indirect intra-machine channel that is 
exploited by a malicious process to transfer sensitive infor- 
mation with violation to the enforced security policy. Covert 
channels are either storage channels or timing channels [4]. 
They can be viewed as the worst possible indirect information 
leakage attack vector for their non-conventional hidden means 
to convey sensitive information, and the huge volume of 
information that could potentially be leaked, specially through 
covert storage channels. 

A covert storage channel involves a malicious process ma- 
nipulating a storage location to convey information indirectly 
to another storage location within a single machine. A covert 
timing channel allows one process to modify its usage of a 
shared system resource, such that, the resulting change in 
system response time is observed by a second process, thereby 
allowing it to infer extra information about sensitive data. 

- Covert Storage Channels 



in confronting a wide variety of previously unknown attacks. 
Detailed steps were mentioned in Subsection F in Section IV. 



Vulnerable State Rolled Back 
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Fig. 11. VAIL's approach to eliminate covert storage channels 

- Leveraging the partitioning property of machine virtualiza- 
tion. Since sensitive files are centrally stored encrypted in the 
Quarantined VM, the VMM could statically partition the host's 
storage space; such that, the Vulnerable VM is allocated only 
the minimal space required to install and run the guest OS and 
the required user applications. 

• Leak out devices states 

In a physical machine, two processes that have access to, 
for instance, the keyboard could leak data regarding its states 
(e.g., num- and caps-lock). Similarly, the Vulnerable VM could 
store such information while the user is entering a decryption 
password and/or an escape sequence to help in inferring them. 
For this reason, upon returning back to vulnerable state, VAIL 
Core suspends and then reboots the Vulnerable VM to reset 
states of its peripheral virtual devices (see Subsection F in 
Section IV). 



Through covert storage channels adversaries may attempt to 
capture contents of sensitive files, and leak out devices states 
as follows: 

• Capture contents of sensitive files 

The vulnerable OS, being potentially compromised, could 
capture contents of sensitive files after opening them in quar- 
antined state. It could then store them locally in the Vulnerable 
VM in a hidden folder in order to leak them out after returning 
back to vulnerable state. VAIL eliminates such covert storage 
channel by two countermeasures: 

- Leveraging the encapsulation property of machine virtual- 
ization. Before opening a sensitive file in quarantined state, 
VAIL Core saves the current vulnerable state in VAIL Server. 
After the user closes the file, and upon his/her request, VAIL 
begins returning from quarantined state back to vulnerable 
state. VAIL Core rolls the vulnerable state back to the state 
that was previously captured before opening the file. Since 
sensitive files are stored in the Quarantined VM, therefore 
all changes that were made to the Vulnerable VM while 
the system was in quarantined state, except those that were 
made to the file, will be discarded (Figure 11). In addition, 
overwriting the entire vulnerable state extensively contributes 



- Covert Timing Channel 

Through covert timing channels adversaries may attempt to: 

• Control the transition timing 

If the potentially compromised vulnerable OS was able to 
control the transition timing, then adversaries would be able 
to: (1) interrupt VAIL's functionality; and (2) infer information 
about a sensitive file after Vulnerable VM snapshot restoration. 
That is, after the Vulnerable VM flushes writes of the disk 
cache to a sensitive file on the Quarantined VM and after 
VAIL Core returns UI focus to the Vulnerable VM, adversaries 
could retrieve the system time. This would allow them to 
figure out the time period a user has spent editing a particular 
sensitive file, which would, in turn, allow them to infer its 
type (e.g., design document, or text document). To eliminate 
this covert channel, VAIL controls the state transition timing 
by two security measures (see Subsection F in Section IV). 

- Using a predefined keyboard escape sequence. Returning 
back to vulnerable state is performed only at the request of 
the user through a predefined keyboard escape sequence. 

- Adding extra delays. By adopting the approach presented in 
[1], VAIL imposes the timing behavior to be autonomous of 
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the sensitive data by adding extra delays. Such that, in each 
state transition, VAIL Server will wait for a static time period 
after flushing writes of the disk cache to a sensitive file. This 
would make it harder for an adversary to estimate the exact 
time period that was spent in viewing or editing a sensitive 
file. 



VI. CONCLUSION 

This paper presented a novel information security approach 
called Virtualized Anti-Information Leakage (VAIL). Its objec- 
tive was to thwart malicious software and insiders' information 
leakage attacks on sensitive files after decryption in potentially 
compromised computer systems. VAIL basically relied on 
leveraging machine virtualization's isolation, encapsulation, 
and partitioning properties to achieve its objective. 

By moving VAIL's security-critical operations to an ab- 
straction layer below that of the vulnerable operating system 
(OS), it isolated its security functionality from adversaries' cir- 
cumvention, disabling, and subversion attacks. Through lever- 
aging machine virtualization's encapsulation property, VAIL 
overcome being ad-hoc. It acquired the capability to thwart 
previously unknown attacks by overwriting the entire state of 
the untrusted potentially compromised virtual machine (VM) 
representing the user's personal computer without affecting the 
sensitive files. In addition, machine virtualization's partitioning 
property contributed in confining covert storage channels. 
VAIL benefited from advantages of file-level encryption, and 
overcome its drawbacks by providing its users automatic and 
transparent cryptographic operations. 

VAIL was designed not to rely on users' commitment to 
security. It provided its users transparent access to sensitive 
files that are forcibly and centrally stored encrypted in a 
dedicated and isolated VM other than the VM representing the 
user's personal computer. In addition, a file's key is encrypted 
and inserted into the header of the encrypted sensitive file. 
Thus, a user does not need to retain any information other 
than his/her password. Furthermore, salting and iteratively 
hashing a user-supplied password preserved its uniqueness and 
hardened it against dictionary-based attacks. 

At runtime, VAIL addresses spoofing attacks by requesting 
users to enter predefined escape sequences in all its critical 
operations. VAIL achieved the 256-bit security strength; it 
made brute-forcing attacks on a file's encryption key almost 
impossible since an adversary would have to test 2 256 possible 
keys that require 2 256 attempts, which is computationally 
infeasible. In addition, VAIL is directly applicable to existing 
commercial off-the-shelf OSs and applications, and without 
any special or additional required hardware. 

VAIL's defenses were evaluated against a variety of infor- 
mation leakage attacks including: (1) direct attacks launched 
on sensitive files from an untrusted VM, and a compromised 
VMM; and (2) indirect attacks exploiting covert storage and 
timing channels. Based on the security evaluation, it was con- 
cluded that VAIL successfully addressed information leakage 



attacks. It effectively complied with the security requirements, 
and met its objective. VAIL's potential users would include 
custodians of sensitive information in business, trade, financial, 
and industrial organizations. 

Despite the aforementioned advantages, a limitation was 
identified. It arose as a result of the tireless efforts to pre- 
serve information confidentiality in a potentially compromised 
computer system. Opening a sensitive file terminates currently 
running network processes. That is, before opening a sensitive 
file, VAIL transits the computer system from its current low- 
confidentiality state to a safe editing high-confidentiality state. 
However, this requires disabling network device drivers of 
the VM that represents the user's personal computer. Con- 
sequently, network processes (e.g., downloading a file from 
the Internet) that may have been running before opening a 
sensitive file will be terminated. However, the authors believe 
that this limitation would not affect VAIL's usability, since, 
as previously mentioned, an organization's sensitive files are 
naturally expected to be clearly identified according to its 
security policy, much fewer compared to total number of 
business files, and with low update frequency. 
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Abstract — Quality of Service is a very important factor to 
determine the quality of a VoIPcall. Different subjective and 
objective models exist for evaluating the speech quality 
inVoIP.E-model is one of the objective methods of measuring 
the speech quality; it considersvarious factors like packet loss, 
delay and codec impairments. The calculations of E-model are 
not very accurate in case of handovers - when a VoIP call 
moves from one wireless LAN to another. This paper 
conducted experimental evaluation of performance of E-model 
during handovers and proposes a new approach to accurately 
calculate the speech quality of VoIP during handovers and 
make MOS calculator which take the results through. A 
detailed description of the experimental setup and the 
comparison of the new approach with E-model is presented in 
this work. 

I. Introduction 

Voice over IP services uses the traditional Internet Protocol 
(IP) to send the voice packets (1). It breaks the voice call into 
small packets that are routed over the internet. Due to the 
unreliable nature of the internet, these packets might get lost in 
the network which results in missing packets at the receiver 
end. As a result, the receiver would hear the speaker's 
sentence incomplete and may not understand it. It is very 
essential to monitor the quality of these voice calls to achieve 
user satisfaction. 

To measure the speech quality various network factors like 
delay, packet loss, jitter etc. are considered. The measured 
speech quality is then mapped to a user satisfaction level. 
Nowadays, many people make VoIP calls when they are 
traveling, thus moving from one network to another. It is very 
important that the user experiences a good call quality when 
the VoIP call gets handed off from one network to another. 
The process of handoff consists of temporarily disconnecting 
from one network and then establishing a connection with the 
new network, this could result in dropped calls or heavy 
packet loss if not performed smoothly. Thus, it is very 
important to measure the speech quality of VoIP during 
handovers to achieve high user satisfaction. 

II. MEASUREMENT OF SPEECH QUALITY: 

Speech quality is the measurement of user experience when a 
VoIP call is established (2). The measurement of speech 
quality is divided into two broad categories: Objective 



measurement and Subjective measurement. Subjective tests 
are user listening tests where users are told to rate the speech 
quality. These tests are expensive to perform and the accuracy 
of speech quality rating 17 relies on the user's mood. To 
measure the accuracy of these subjective tests, objective 
methods are used. These methods are the computational 
methods that usually compare a good quality signal to a 
degraded signal (4). 

A. MEAN OPINION SCORE (MOS ) - SUBJECTIVE 
LISTENING TEST: 

Mean Opinion Score (MOS) is International 
Telecommunications Union Telecommunication 

Standardization sector (ITU-T) approved. It is a subjective 
listening test where the user rates the speech quality during the 
call. 

MOS test ratings can be used to compare various codec's such 
as iLBC and G.7 11. although; MOS tests are the most reliable 
method of measuring the speech quality they are cumbersome 
to perform. They are considered as expensive tests and are 
quite time consuming, so it's difficult to perform them 
frequently. 

B. PERCEPTUAL EVALUATION OF SPEECH Q UALITY 
(PESQ) - OBJECTIVE METHOD: 

Perceptual Evaluation of Speech Quality (PESQ) is an ITU-T 
standard for objective measurement. It was introduced as 
MOS subjective tests were expensive to conduct and required 
a lot of time. PESQ test setup automatically maps the PESQ 
score to the subjective MOS score. 

It takes into account two signals; one is the reference signal 
while the other one is the actual degraded signal. Both the 
signals are sent through the test that uses the PESQ algorithm 
and the result is a PESQ score as shown in figure 1 below. 
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Figure 1: PESQ Testing 
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Major drawbacks of PESQ approach was that it did not take 
into account the various impairments such as acoustic echo, 
transmission delay etc. Also this approach cannot be used to 
monitor real time calls and compare codec's accurately (8). 

C. E-MODEL - OBJECTIVE APPROACHES 

The E-model is transmission planning tool that provides a 
prediction of the expected voice quality (10), as perceived by a 
typical telephone user, for a complete end-to-end(i.e. mouth to 
ear)telephone connection under conversational conditions. The 
E-model takes into account a wide range of telephony-band 
impairments, in particular the impairment due to low-bit-rate 
coding devices and one-way delay, as well as the classical 
telephony impairments of loss, noise and echo. It is a new 
objective model proposed by ITU-T and it takes into account 
all the drawbacks of PESQ.lt is a non-intrusive method of 
predicting the voice quality .E-model takes into account 
various factors that affect the speech quality and calculate a 
Rating factor(R-f actor) that ranges between 0 - lOO.the R- 
factor can also be converted into a MOS rating to give the 
MOS score 

The R-f actor is calculated as: 

Robj=R0-Is-Id-Ie + A (1) 
Where: 

R0: Signal to Noise Ratio (S/N) at 0 dBR point 

Is: Various speech impairments (e.g. Quantization noise, 
side tone level) 

Id: Impairments that occur due to delay (e.g. absolute 
delay, echo) 

Ie: Impairments caused by the equipment (e.g. codec's, 
jitter, packet loss) 

A: Advantage factor (A is 0 for wire line and A is 5 for 
wireless) 
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Figure 3 R-factor Rating with MOS Score Mapping and User Satisfaction 

Level 

Based on ITU G.107 recommendation, the R- factor equation 
can be simplified as: 

R-factor = 93.2 - Id -Ie- A 

Where: 

A:The Advantage factor; 0 for wire line and 5 for 
wireless networks. 

The value of Ie, which is codec dependent impairment, is 
calculated as: 

Ie = a + bln(l+cP/100) 

Where: 

P: The percentage packet loss and a, b and c are codec 
fitting parameters. 

Codec fitting parameters for iLBC (3) and G.711 (9) are 
summarized in table (2.1) 

Table I Fitting Parameters for Codec G.71 1 and iLBC 



Parameters 


G.711 


iLBC 


A 


0 


10 


B 


30 


19.8 


C 


15 


29.7 


Bitrate(kb/s)/frame 


64/20 


15.2/20 


size(ms) 







Figure 2 shows the terms of R-factor equation. 



The value of Id, which is impairment due to delay is calculated 

as: 

Id = 0.024d + 0.1 1 (d - 177.3) H (d - 177.3) (3.5) 
Where: 

d:The total one way delay (includes serialization delay, 
processing delay and Propagation delay) in milliseconds. H(x) 
is a step function defined as: 



44 



http://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



(IJCSIS) International Journal of Computer Science and Information Security, 
Vol. 12, No. 9, September 2014 



H(x) = 0, x<0 and H(x) = 1 otherwise. 
The drawback of E-model is that the MOS scores 
calculated by E-model do not correlate very well with 
the subjective MOS scores. Also E-model does not 
calculate the packet loss and delay accurately during 
handovers, when a VoIP call moves from one network 
to another. 

III. IMPACT OF HANDOFF ON VOIP 

Various factors that affect the speech quality in VoIP have 
already been discussed, but how a speech quality of VoIP is 
affected by a call handover. 

A. HANDOFF IN VOIP 

Handoff is the process of transferring a call connection from 
one base station to another basestation in a different cell or 
network. Process of handoff usually takes place when a user 
movesaround a geographical area. In a VoIP call established in 
a wireless network, handover happenswhen a user moves from 
one wireless network to another network. 

So the handover stages can be summarized as: 

Stage 1: The Mobile station (MS) communicates with the 
serving Base station 1. 

When the MS enters the overlapping region of two networks 
then: 

Stage 2: The MS is disconnected from the serving Base 
Station for a while, in this stage there is 

no connection to the network. 

Stage 3: A new connection is established with the target Base 
station 2. 

B. Q UALITY OF SERVICE D URING HANDOVER 

The Quality of service of VoIP calls was subjectively 
calculated using the MOS tests andobjectively it was 
calculated with E-model(7). 

1) SUBJECTIVE MOS TEST : When MOS test was 
conducted during handover, the listeners experienced 
a gap (silence) for awhile (that was during the 
handover phase) and after handover was complete 
they could hear thetest sentences. Some calls got 
dropped as the mobile user could not connect to the 
other wireless network. Thishappened due to the 
excessive delay during handover process; the mobile 
user moved out of the handover region and also did 
not get authenticated to the new network, leading to a 
dropped call. 



2) E-MODEL CALCULATIONS: The MOS score was 
also objectively calculated using the E-model. The 
packets were captured using Ethereal during 
handover. During handover the voice call 
wastemporarily disconnected as users did not hear 
anything for both G.711 and iLBC, but 
Etherealshowed a 0% - 0.02% packet loss for G.711 
and 0.3-1.1% packet loss for iLBC codec.The 
drawback with E-model calculations using Wireshark 
tool is that it does not accuratelycalculate the packet 
loss during handover. Ethereal showed a 0% - 0.02% 
packet loss for G.711and 0.3-1.1% packet loss for 
iLBC codec whereas the actual packet loss was much 
more. Thusthe E-model calculations for handover 
scenario show a very high difference between 
MOSsubjective and MOS objective scores. 

C NEW OBJECTIVE MODEL 

The new objective model that I propose is based on studying 
the Wireshark packets duringhandover. The handover delay 
with reference to the handover stages in figure can be 
definedas: 

The delay that occurs between the time of disconnection from 
BS 1 and the time of settingup the connection with the BS2 is 
the handover delay. 

Therefore, the handover delay can be calculated by measuring 

Synchronization delay, delay due to ranging information and 
Registration delay. 

Delay (HO) = d(sync) + d(rang) + d(reg) (1) 

Now the packet loss during handover is also measured by 
calculating the packets sent during thesynchronization, 
ranging and registration phases. The Wireshark screenshot in 
figure showsthe packets that are being sent from 192.168.1.6 
to 192.168.1.8 only but at this time user cannotlisten to 
anything, i.e. they are the packets lost during handover. 

Therefore handover packet loss will be: 

P(h) = No. of packets sent during handover phase / Total 
packets sent 

Thus from the new approach the enhanced E-model equation 
becomes: 

R = 93.2-Idh-Ieh 
Where: 

Idh= 0.024*_+0.11*(_ -177.3)*H (_ -177.3) 
Where: 
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_ = d (sync) + d(rang) + d(reg) + d(net) + packetization delay 
+ processing delay 

Therefore: 

_ = D (ho) + d(net) + packetization delay + processing delay — 
-(2) 

From E-model Ie was as: 
Ie = a + bln(l+cP/100) 

Therefore Ie for handover measurement will be: 
Ieh= a + bln(l + c(p+Ph)/100) 

Where: 

Ph:The packet loss during handover 

IV. TEST SETUP 

The test setup consisted of windows machine and mobile that 
had a VoIP call established. The VoIP client was downloaded 
on both machines and was configured to use RTP ports 
forsending and receiving voice packets. One laptop was fixed 
while the user on the other laptop wasmobile during handover. 
The MOS subjective tests were performed for VoIP call with 
andwithout handovers using both the codec's G.71 1 and iLBC. 
During each test, packets werecaptured using Wireshark tool 

A. VOIP CLIENT: The VoIP client used in the project is 
a freely available VoIP client. 

B. MOS SCORE CALCULATOR: The MOS score 
calculator that I developed is based on the new 
proposed E-model for handovers. 

It calculates the R-factor and the corresponding MOS 
scores for speech samples duringhandovers using the 
new approachthe main purpose of this tool is to 
reduce the manual effort. 

V. IMPLEMENTATION RESULTS 

In order to calculate the speech quality of VoIP during 
handover, firstly MOS test was conductedfor 12 participants. 
This subjective MOS test was performed with and without 
handover. Tensample Hindi test sentences were played and 
participants rated each test sentence based on thequality. 

A. SCENARIO WITHO UT HANDOVER 

1 ) MOS SCORE (OBJECTIVE) FOR G. 71 1 

E-model calculations for G.711 without 
handover: 

Delay delta=20ms 

Total delay d= delta + packetization delay + 
processing delay; = 20 + 20 + 5 = 45ms 

Id=0.024 *d + 0.11(d-177.3) H (d-177.3) = 1.08 



Ie is 0 for G.711 
R=Ro-Id-Ie = 92.12 
Therefore MOS = 4.383 

2) E-model calculations for iLBC without handover: 
Delay delta=30ms 

Total delay d= delta + packetization delay + processing 
delay; = 20 + 20+15 = 65ms 

Id=0.024 *d + 0.11(d-177.3) H (d-177.3)= 1.38 

Ie is 10 for iLBC 

R=Ro-Id-Ie = 81.64 

Therefore MOS = 4.084 

B . SCENARIO WITH HANDOVER 

1 ) SUBJECTIVE MOS SCORE 

When MOS test was conducted during handover, the listeners 
experienced a gap (a silence) for awhile (that was during the 
handover phase) and after handover was complete they could 
hear thetest sentences. The MOS test was performed with 12 
participants. 

Some calls got dropped as the mobile user could not connect 
to the other wireless network. Thishappened due to the 
excessive delay during handover process; the mobile user 
moved out of thehandover region and also did not get 
authenticated to the new network, leading to a dropped call.l 

Codec Average MOS Score 
iLBC 3.143 
G.711 3.311 

Table Average MOS scores for G.711 and iLBC during 
handover 

2 ) E-MODEL CALCULATION FOR G. 711 
Avg one way delay =20ms 

d=20+5+2() = 45ms 

Therefore Id = 0.024*d = 1.08 

Packet loss (P) for G.711 was 0.08% 

Therefore, Ie = a + bln(l+cP/100) 

= 0+301n(l+0.08* 15/100) =0.0358 

Rfactor = 93.2-1.08 - 0.358 = 91.76 

MOS (emodel) = 1 + 0.035*R + R(R-60)(100-R)7*10 A -6 
MOS (E model) = 4.38 
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3 ) E-MODEL CALCULATION FOR Ilbc 
Packet loss = 0.12% 
Delay = 30 ms 
Id = 1.56 

Ie = a + bln(l+Cp/100) 

= 10 + 201n(l + 0.12*30/100) 

Ie = 10.707 

Rfactor = 93.2-1.56-10.707 = 80.93 
Therefore, MOS (E model) = 4.059 

The results of MOS (subjective) and MOS (E-model) are 
summarized in figure 

VI. CALCULATIONS USING NEW APPROACH 
A. NEW E-MODEL CALCULATION FOR ILBC 
Avg one way delay =3 0ms 
d=30+ 15+20 = 65ms 
Therefore Idh = 0.024*d = 1.56 
Packet loss (P) for iLBC is 0.12% 
Handover packet loss (Ph) = 15.09% 
Therefore, Ieh= a + bln(l+c(P+Ph/100)) 
= 10+201n(l+15.21*30/100) = 44.32 
Rfactor = 93.2- 44.32 = 47.32 

MOS (New-Emodel) = 1 + 0.035*R + R(R-60)(100- 
R)7*10 A -6 

MOS (New - Emodel) = 2.45 

B. NEW E-MODEL CALCULATION FOR G.711 

Avg one way delay -20ms 
d=20+5+20 = 45ms 
Therefore Idh= 0.024*d = 1.08 
Packet loss (P) for G.711 was 0.08% 
Handover packet loss (Ph) = 14.75% 
Therefore, Ieh= a + bln(l+c(P+Ph/100)) 
= 0+301n(l+14.83*15/100) = 35.12 
Rfactor = 93.2- 35.12-1.08 = 57 

MOS (New-Emodel) = 1 + 0.035*R + R(R-60)(100- 
R)7*10 A -6 

MOS (New - Emodel) =2.94 



Comparison between G.71 1 and iLBC: 

Without handover: 

Delay: 



1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 



■50ms 



-25ms. 



10s 20s 30s 40s 50s 



1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 



-50ms 



-25ms 



5s 10s 15s 20s 25s 30s 35s 40s 45s 50s 
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We note that the delay happen when use iLBC is more than 
what happen with G.711 as in figure(5.1) and figure(5.2). 

With handover 
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Figure(5.3) and figure(5.4) show that the delay when using 
iLBC is much bigger than what happen with G.711 with 
handover. 

CONCLUSION 

The results of the MOS calculator shows that the new 
approach maps very close to the subjective MOS scores as 
compared to E-model and helps to calculate the speech quality 
during handoff much accurately. G.71 1 codec is a betterspeech 
codec than iLBC codec. The speech quality for G.711 is 
extremely good without handoff,but during a call handoff, the 
speech quality does degrade but not as much as iLBC. The 
speechquality for iLBC is tremendously degraded during a call 
handover and leads to userdissatisfaction. 
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Abstract — As one of the most important tasks of web usage mining, 
web user clustering, which establishes groups of users exhibiting 
similar browsing patterns, provides useful knowledge to 
personalized web services. There are many clustering algorithm. 
In this paper, users' similarity is calculated then a comparative 
analysis of two clustering algorithms namely K-means algorithm 
and hierarchical algorithm is performed. Web users are clustered 
with these algorithms based on web user log data. Given a set of 
web users and their associated historical web usage data, we study 
their behavior characteristic and cluster them. In terms of 
accuracy K-means produces better results as compared to 
hierarchical algorithm. 

Keywords-clustering; K-means algorithm; hierarchical algorithm 

I. Introduction 

The World Wide Web has become increasingly important as 
a medium for commerce as well as for dissemination of 
information. In E-commerce, companies want to analyze the 
user's preferences to place advertisements, to decide their 
market strategy, and to provide customized guide to Web 
customers. In today's information based society, there is an 
urge for Web surfers to find the needed information from the 
overwhelming resources on the Internet. Web access log 
contains a lot of information that allows us to observe user's 
interest with the site. Properly exploited, this information can 
assist us to make improvements to the Web site, create a more 
effective Web site organization and to help users navigate 
through enormous Web documents. Therefore, data mining, 
which is referred to as knowledge discovery in database, has 
been naturally introduced to the World Wide Web. When 
applied to the World Wide Web, data mining is called Web 
mining. Web mining is categorized into three active research 
areas according to what part of web data is mined, of which 
Usage mining, also known as web-log mining, which studies 
user access information from logged server data in order to 
extract interesting usage patterns. In this context, cluster 
analysis can be considered as one of the most important aspects 
in the Web mining process for discovering meaningful groups 
as well as interpreting and visualizing the key behaviors 
exhibited by the users in each cluster. The clustering problem is 
about partitioning a given data set into clusters such that the data 
points in the same cluster are more similar to each other than 
points in different clusters. 

In this paper, we explore the problem of user clustering and 
then a comparative analysis of two clustering algorithms 



namely K-means algorithm and Hierarchical algorithm is 
performed. The performance of these clustering algorithms is 
compared in terms of accuracy. 

The rest of the paper is organized as follows: in section 2, 
related work will be introduced. In section 3 method and 
clustering analysis will be explained. The experimental result 
and discuss will be introduced in section 4. Finally, section 5 
concludes the paper. 

II. Related Work 

Several researchers have applied data mining techniques to 
web server logs, attempting to unlock the usage patterns of web 
users hidden in the log files. Data mining, which is referred to 
as knowledge discovery in database, has become an important 
research area as a consequence of the maturity of very large 
databases. It uses techniques from areas such as machine 
learning, statistics, neural networks, and genetic algorithms to 
extract implicit information from very large amounts of data. 
The goals of data mining are prediction, identification, 
classification, and optimization. The knowledge discovered by 
data mining includes association rules, sequential patterns, 
clusters, and classification. Garofalakis [1] gives a review of 
popular data mining techniques and the algorithms for 
discovering the Web. Reference [2] proposed a taxonomy of 
Web mining and identified further research issues in this field. 
Yu [3] examines new developments in data mining and its 
application to personalization in E-commerce. Reference [4] 
has demonstrated that web users can be clustered into 
meaningful groups, which help webmasters to better understand 
the users and therefore to provide more suitable, customized 
services. Mobasher, Cooley and Srivastava [5] propose a 
technique for capturing common user profiles based on 
association rule discovery and usage-based clustering. This 
technique directly computes overlapping clusters of URL 
references based on their co-occurrence patterns across user 
transactions. 

Nasraoui and Krishnapuram [6] use unsupervised robust 
multi-resolution clustering techniques to discover Web user 
groups. Xie and Phoha [7] use belief functions to cluster Web 
site users. They separate users into different groups and find a 
common access pattern for each group of users. Xu and Liu [8] 
cluster web users with K-means algorithm based on web user 
log data; they introduced 'hits' concept, hits mean one kind of 
user browsing information. We can directly extract the hits of 
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all users who access the Web pages of a Web site during a given 
period of time, hits(i, j) is the count of user i accesses Web page 
j during a defined period of time. Count of visiting the pages is 
the criterion that is used for clustering. Reference [9] 
emphasized the need to discover similarities in users' accessing 
behavior with respect to the time locality of their navigational 
acts. In this context, they present two time-aware clustering 
approaches for tuning and binding the page and time visiting 
criteria. The two tracks of the proposed algorithms define 
clusters with users that show similar visiting behavior at the 
same time period, by varying the priority given to page or time 
visiting. 

III. Method and Clustering Analysis 



A. Calculate Users' Similarity 

The method begins with preprocessing of server logs and 
then users' sessions are extracted. The methods of comparing 
similarity between users based on a criteria will be presented and 
user clustering will be done by two algorithms namely K-means 
algorithm and hierarchical algorithm. Accuracy of these two 
algorithm will be analyzed. 

Preprocessing of web server log files is conducted to identify 
user sessions. Web servers often registered all the activities of 
users in the form of web server logs. Because of the different 
configurations of servers, there are several types of server logs. 
But normally server log files, have the same basic information, 
such as client IP address, time of request, requested URL, the 
status code of HTTP, references and more. Several 
preprocessing operations should perform before applying the 
web usage mining techniques on the web server logs. These 
operations in the scope of our research include data cleansing 
and identifying and separating users' sessions. All data in web 
server logs are not suitable for web usage mining. So to remove 
the improper data from the log file, data cleansing step is 
accomplished. 

A user session is a set of pages seen by the user during a 
special visit from a website. Before applying web usage mining 
techniques, web server logs should be grouped into meaningful 
sessions. In this study pages are considered as a session that are 
requested in a period of time less than equal to a certain time. 
Two appropriate features of user which represent user's interests 
are 'page view frequency' and 'time of viewing the page'. After 
web server log's preprocessing, the amount of these features are 
calculated and then Cosine similarity is used to calculate the 
amount of similarity between each two users. 

The similarity between two users can be measured by 
counting the number of times they access the common pages. 
We use the cosine similarity as the similarity measure. In this 
case, the measure is defined by 
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Where Sf (u it U-) is a similarity between user u t and user Uj 

based on page view frequency. freq(u if Pj) shows the number 

of time the user u t accessed to the page p.. 

In like manner, the similarity between two users can be 
measured more precisely by taking into account the actual time 
the users spent on viewing each web page. We use the cosine 
similarity as the similarity measure. In this case, the measure is 
defined by 



S f (Ui,Uj) = 



£k(freq (Uj,p k ) * freq (uj,p k )) 
JSkfreq ( Ui ,p k ) 2 * £ k freq (Uj,p k ) 2 



(1) 



S t (Ui, up = - 



Sk(t (ui,p k ) * t (uj,p k )) 



(2) 



JSkt(ui, Pk ) 2 * SktCu^Pk) 2 
Where S t (U^ Uj) is a similarity between user l^and user 
Uybased on time period of viewing the page, t (jii, Py) 
shows the amount of time user spend on viewing the page 

pj- 

As a result, after using equation (1) and (2) two users- pages 
matrices are gained. For clustering we need to have one matrix. 
Ergo, equation (3) is defined by 

w(iii, Pj) = a freq(ui, pj) + b t(u i; pj) (3) 

Where w(ui, Py) is a weight given to page Pj based on user 

Ui features and a, b are experience values in the site. The 
similarity between users can also be gained based on weighting 
criterion by use of cosine similarity. 

B. Clustering Algorithms 
1 ) K-Means Clustering 

In this stage K-means clustering algorithm is performed. The 
flow of algorithm is shown as the following steps: 

1. Place K points into the space represented by the 
objects that are being clustered. These points represent 
initial group centroids. 

2. Assign each object to the group that has the closest 
centroid. 

3. When all objects have been assigned, recalculate the 
positions of the K centroids. 

4. Repeat Steps 2 and 3 until the centroids no longer 
move. This produces a separation of the objects into 
groups from which the metric to be minimized can be 
calculated. 

At the end of K- Means clustering stage we K clusters will be 
obtained that users in each cluster patterns would be most 
similar to each other with respect to individual preferences. 

C. Hierarchical Clustering [10] 

Hierarchical Clustering method merged or splits the similar 
data objects by constructing hierarchy of clusters also known as 
dendogram. Hierarchical Clustering method forms clusters 
progressively. Hierarchical Clustering classified into two 
forms: Agglomerative and Divisive algorithm. 

• Agglomerative hierarchical clustering is a bottom up 
method which starts with every single object in a 
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single cluster. Then, in each successive iteration, it 
combines the closest pair of clusters by satisfying 
some similarity criteria, until all of the data is in one 
cluster or specify by the user. 
• Divisive hierarchical clustering is a top down 
approach. Divisive hierarchical clustering starts with 
one cluster that contain all data objects. Then in each 
successive iteration, it divide into the clusters by 
satisfying some similarity criteria until each data 
objects forms clusters its own or satisfies stopping 
criteria. 

IV. EXPERIMENTAL EVALIATION 

The real data set were used for this study which related to 
Palood dairy 1 products company. The result of this report is for 
two months (June, July 20 14). The size of access log for two 
months is 56 MB. The information contained in the user access 
log of this site includes more details of users' requests so 
unnecessary information has been refined. The information 
stored for this study include: requesting IP address, date and 
time of the request, Parameters sent in every web address, 
communication method, user's browser type and operating 
system and page sizes. After collecting information data 
preprocessing was done. To implement the components of the 
proposed approach, the SQL Server 2008 database, Visual 
Studio 2010 software and RapidMiner was used. RapidMiner is 
a software that provides an integrated environment for machine 
learning, data mining, text mining, predictive analytics and 
business analytics. 

After refining the raw data the number of distinct users for 
two mount who visited the site was gained. The number of 
distinct users was 1240 and the number of valid web pages was 
160. 

The value of equation (3) were calculated and the similarity 
between users was gained based on weighting criterion by use 
of cosine similarity and then was given to RapidMiner software 
in form of Users-Pages matrix. The values of a and b in equation 
(3) set to 0.7 and 0.3 respectively. 

Having introduced the two clustering algorithms, now turn to 
the discussion of these algorithms on the basis of a practical 
study. The experimental result of these two algorithm will be 
presented by using real dataset. K-Means algorithm is applied 
to cluster web users with different k values. The optimal 
precision would be gained when number of clusters (K) is set to 
3, figure 1 is illustrate this result. The experimental results of 
both algorithm are presented in Table 1. As results show, K- 
Means algorithm have better accuracy in comparison with 
hierarchical algorithm. In like manner, K-Means algorithm take 
lower time. 
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Figure 1 . K-means accuracy with different k values 



TABLE 1 . Results of clustering 



Algorithm 


Number of 


Time 




clusters 


accuracy 


Hierarchical 


3 


2.53 S 


61.12% 


K- Means 


3 


1.45 S 


64.33% 



V. CONCLUSION 

In this paper, users' similarity was calculated then a 
comparative study has been performed on two clustering 
algorithms namely K-means algorithm and hierarchical 
clustering algorithms. Comparison was performed on real data 
set. Web users are clustered by these algorithms based on web 
user log data. K-means algorithm had a better result consider to 
accuracy and the time. 
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Abstract - Scheduling is one of the most important concepts in 
Operating Systems . One of the most popular algorithms is 
Round - Robin , which switches the processes after running the 
set Time Quantum ( TQ . TQ value affects the average time of 
Waiting and Turnaround , and the number of Context Switches ( 
CS ) . This definition can be static , which does not change , and 
dynamic, calculated cycle after cycle. This review builds on the 
study of new techniques for the determination of TQ 
dynamically . Initially is shown that in all cases this method is 
efficient and then we rank the most important techniques used . 
We look at how each works and the differences and their 
similarities . We will observe their efficiency in different 
parameters and the conditions in which they are effective . 
Finally we show that MDTQRR is most effective, minimizing the 
number of CS and Harm is the most effective in AVG ( Waiting 
and Turnaround ) Time . 

Key words - Round-Robin , Quantum Time , Waiting Time , 
Turnaround Time , Context Switch, ready queue . 

I. INTRODUCTION 

A process is an instance of a computer program that is 
executing. It includes the current values of counters, registers 
and variables. The difference between a program and a process 



is that the program is a set of instructions while the process is 
an activity. Processes that are waiting to be executed by the 
processor are stored in a queue called the Ready Queue. The 
time during which the proces keeps the CPU is knows as Burst 
Time. Arrival time is the time at which the process reaches the 
ready queue. Waiting time is called the length of the stay of the 
process in the ready queue. Context switch is the number of 
times the CPU switches from one process to another. 
Turnaround time is the time from the arrival of the process to 
ready queue until its completion. The best algorithm would 
have minimum waitin, minimum turnaround time and smaller 
number of Context switches. 

//. SCHEDULING ALGORITHMS 

First Come First Served (FCFS): 

In this algorithm, the first proces who makes a request is 
selected. Although very simple, it has major shortcomings in 
performance compared to the other algorithms. With FCFS 
many short processes wait too long. If a long process takes its 
time, others must wait until it has finished. This effect is called 
the convoy. 
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Shortest Job First(SJF): 

In this algorithm short processes have priority. If two processes 

have the same duration, then it takes that who came first, 

turning into FCFS [14]. 

Shortest Remaining Time First (SRTF): 

This algorithm resembles with SJF but with some 

modifications. When the burst time of the processes is 

calculated, the remaining time to finish of the current process is 

taken into account. 

Priority Schedulling Algorithm: 

It defines priorities to each of the processes and their selection 
is made based on the highest priority. But this method can 
leave a process with lower priority waiting forever. If the 
system is loaded it risks going to starvation. A solution could 
be that you increase priority of the processes that stay very 
long in waiting [9]. 

Round Robin Scheduling Alghorithm: 

Round Robin is one of the oldest, most naive, most fair and 
most spread. Each of the processes has the same priority k, and 
is given some time, Time Quantum (TQ), and after this period 
has passed the proces is switched. Two situations occur in this 
case, first, whether the given TQ is longer than its burst time 
the process then leaves the processor itself, secondly, if TQ is 
less than the burst time, then the proces is switched and 
positioned at the end of the ready queue [13]. 
Lottery Scheduling Algorithm: 

Is an algorithm that gives each proces a ticket and then 
randomly generates the number of the proces who will take 
turn, and the proces who has that number takes the turn. 

III. STATIC AND DYNAMIC ROUND-ROBIN 

Round-Robin is an algorithm who has some advantages 
compared to the others. It is simple, it doesn't have interrupts 
and data sharing, it has no traffic and is very suited to systems 
who have only sequential actions. 

But it's major challenge is in determining of the TQ. This is 
the parameter which affects in the mean Waiting and 
Turnaround time in the execution order [5]. One of the main 
subjects of study in this field is the comparison of the static 
time quantum, meaning that it is pre-determined and never 
changes in any of the cycles, and the dynamic TQ which is 
redefined in every cycle, renovating its inputs, which in turn 
are dependent on the various techniques used to determine the 
TQ, but mostly they have as main input the burst time and the 
number of processes in the ready queue. If TQ is static it 
causes a small number of context switches for a high TQ and a 
high number for context switches if TQ is small. Higher 
number of context switches means higher mean waiting and 
turnaround time and all this leads to an overhead that lowers 
the performance of the system [7]. So the main goal of the 
dynamic form is to determine the right TQ value. The different 
formulas are set forth below. 

IV. NEW EMBEDDED TECHNIQUES IN THE ROUND-ROBIN 
SCHEDULING ALGORITHM 

In this field of study, the embedded techniques in the 
round-robin algorithm, there has have been many researches 
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and effort to further improve it in the last four years. Many 
summaries are made regarding round-robin but still no new 
techniques. Below will be presented the main techniques that 
have contributed and have the opportunity to be improved and 
developed in the later stages. In the way that each of the 
techniques will be described, there will be attached the similiar 
techniques in defining the TQ and the publication time of this 
technique. It is to be emphasized that many are improvements 
of each other. 

a. Shortest Remaining Burst Round- 
Robin(SRBRR) 

SRBRR is an improved algorithm of simple R-R giving the 
processor to processes with the shortest remaining burst in the 
form of round-robin using dynamic TQ. It performs better than 
RR in relation to the Waiting and Turnaround time and the 
number of Contex Switch-s. The redefinition is done every 
time a new process comes in ready queue, and the Time 
Quantum is defined as the median of the bursts of the 
remaining processes in ready queue [2]. 

TQ=Median(remaining burst time of all processes) 

It should be emphasized that the median is calculated in an 
ascending sorted queue. This technique has also served as the 
starting point for many other researchers to develop more. And 
exactly at the median concept these studies were initiated. 

b. Improved Shortest Remaining Burst Round- 
Robin(ISRBRR) 

A further improvement of the above technique is an 
algorithm named Improved- SRBRR. The difference of these 
two techniques is precisely the definition of TQ, this is the 
motive of all the techniques included in the study. The manner 
of operation of the algorithm is the same, when a new process 
arrives the variabiles such as burst time and the number of 
processes is renovated, but the difference lies in the TQ 
definition formula [12]. 

TQ=Ceil(sqrt(median*highest burst time)) 

By looking at the formula we see that TQ is calculated from 
the median and the highest burst time. In this case the ceiling 
value of the square root is taken. Relative to the other 
techniques that will be studied, ISRBRR has a significant 
improvement over its simplest version, in all parameters such 
as the number of context switches, average Waiting and 
Turnaround time. Also this improvement is noted not only in 
the ascending sorted queue, but also in the descending sorted 
queue and the random queue. Below is its schematic: 
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d. Average Max Round Robin (AMRR) 

Another attempt to improve the number of Context 
switeches especially bringing their reduction, is also the 
AMRR technique. It is understood that these techniques are 
very similar to each other for the organization and functioning 
in that they all are based on quantum dynamical time. 
Differences are seen mainly in how TQ definition is realized 
and what is the object or aim of improvement, relative to the 
three parameters mentioned before. Another difference is also 
whether or not these effects are dependent on the order of the 
ready queue. The case in question has no impact on whether or 
not sorted, because of the improvement in the reduction of 
Contex Switch-s. The formulas used in this case are two, first 
calculateing the average amount of time and TQ is taken as the 
average of this with the highest value of Burst Time [1]: 

AVG=SUM(Burst Time of all processes)/Number of processes 
TQ=(AVG+MAX(BT))/2 



e. Self-Adjustment Round-Robin(SARR) 

A very important role in the further development of a better 
dynamic selection of TQ undoubtedly is played by the SARR 
publication. It laid the foundation of a series of new research 
and development that would follow. Important role and 
influence had the preparation in the field of mathematics of the 
researchers. He made two important formulas that would later 
be coordinated to function effectively. 



Img 1. ISRBRR block diagram [12] 
c. Average Mid Max Round Robin(AMMRR) 

Another presented technique is AMMRR, which uses a 
different form of TQ calculation from the above techniques. In 
this case the time that is given to the process is calculated in 
two steps taking into account the minimum and maximum 
value of burst time[ 1 ] . 

Mid=(Min + Max)/2 TQ=(Mid+Max)/2 

So the formula has another form: 

TQ=(Min+3*Max)/4 



In this technique too firstly ready queue is sorted and then 
in the form of the above techniques the TQ determination is 
made. But its value is determined in two steps, first calculating 
the average of the burst time extremes and then the average 
found value with the maximum of ready queue. This method 
has the main purpose of improving to reduce the number of 
switch, this time at the expense of average Waiting and 
Turnaround. If we take into account the last two parameters it 
would be more effective to use ISRBRR. 
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f. Dynamic Quantum with Readjusted Round 
Robin(DQRRR) 

A further development of the above technique is the 
modified algorithm DQRRR. His largest impact was in the 
lowering of the context switches [4]. But again this research 
will be subject to further development with two other methods: 
MMRR and MDTQRR 



NQ= median [remaining 
burst time for all ready 
processes with status=l) 




C end ^ ) 

Img 2. SARR block diagram [3] 

The median is again based on this concept but is a modified 
algorithm in the form of his choice. This is done by the odd and 
even numbers. The formula is as follows [3]: 



Y[(N+l)/2] 



if N odd 



TQ= 



Vi [(Y(n/2)) + Y(l+N/2)] if N even 



where Y is the number of positioning among numbers listed in 
ascending order. The above formula represents the 
modification made in the above-mentioned technique [3]. It has 
greater effectiveness if it has large number of data. 



g. Min-Max Dispersion Measure Round 
Robin(MMRR) 

Another variant of Round Robin is MMRR, which defines 
TQ time after time using minimum and maximum values of 
left burst time [15]. 



M= M AXBT-MINB T 



TQ= 



Y(N+l)/2 



if N odd 



1 / 2 (Y(n/2)) + Y(l+N/2) if N even 



Where : M AXB T=M AXimum Burst Time 
MINBT =MINimum Burst Time 



where Y is the number of positioning among numbers listed in 
ascending order. But this formula can be transformed based on 
a simple logic that TQ can not be less than 25 being that if a 
process will average a TQ = 20 and to avoid situations that 
could lead to performance degradation we obatin a 
comparative TQ of 25. 



TQ= 



X if X>=25 



25 if X<25 



This was a method projected to maximize the CPU 
utilization, maximizing throughput, minimizing turnaround, 
waiting and response (although this is in its minimum value). 
Based on the simple RR that was in the time it was realized it 
was quite effective but with further development it was seen 
only as an object of study from which techniques as DQRRR, 
MMRR and MDTQRR were later based. 



TQ is defined which takes the newfound value from the 
above formula or fixed value according to the report which is 
given below: 



TQ= 



M if M>=25 



25 if M<25 



So it is an improvement of SARR. M is the space of burst 
times. 
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Img 3. MMRR block diagram [15] 



We presented in the above block diagramhow this 
algorithm works, but also a part of SARR given that the only 
added function is that of calculating the median. The benefits 
of this algorithm are not very high in comparison with the other 
algorithms under study. 

h. Even Odd Round Robin (EORR) 

The definition of dynamic time quantum has developed 
another technique of determining the TQ. Two time quantums 
are determined on the basis of determining the position of the 
odd or even numbers [10]: 

TQl=AVG(Burst Time of Even Processes) 
TQ2=AVG(Burst Time of Odd Processes) 



Once both TQ are determined then it takes the value of one 
of the top two values which results lower. This form makes the 
burst time values close. This technique is not effective given 
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that other methods like SRBRR are optimized. It was built with 
the aim to optimize simple round robin but is worse in relation 
to system performance. This is based on all the determining 
parameters, average time of Waiting and Turnaround and 
number of Contex Switch-s which fail to minimize and be 
efficient. 

i. Multi-Dynamic Time Quantum Round Robin 
(MDTQRR) 

This is a method which calculates the time quantum twice 
within a round robin cycle. This algorithm also considers the 
proces time of arrival in the ready queue and is implemented in 
the algorithm. This algorithm has higher efficiency than RR 
and SRBRR but when is sorted in ascending order. In relation 
with Improved-SRBRR has very high efficiency to reduce the 
number of Context Switch, although an impact here has even 
the machine. In this algorithm are introduced two new concepts 
labeled MTQ (Time Quantum Media) and UTQ (Upper 
Quartile Time Quantum) which are calculated from the 
respective formulas. These two parameters are calculated once 
in each cycle. Formulas are given as follows: 




Y(N+l)/2 if N odd 

Vi (Y(n/2)) + Y(l+N/2) if N even 



where Y is the number of the location in the group of numbers 
sorted in ascending order, and N is the number of processes. To 
calculate UTQ the formula is [6]: 

UQ= 3 /4(N+l) where N is the number of processes 

CRITERIA= [{MTQ*m}+{UTQ*(N-m)}]/N 



In this form the highest percentage of processes end in the 
startin cycles. This is seen from the chart below: 




1 2 3 4 5 6 
Rounds 

Img 4. Chart for % of processes in each round [6] 
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In the chart above is seen that about 75% of the processes 
will end in the first round and the other 25 % part does not last 
more than 6 rounds. This drastically decreases the context 
switch but also increases overhead to the operating system in 
many cases. Their number is given as follows: 

Qt=[sum(Kr)]-l 



where Qt is the number of Context Switches, r the number of 
rounds and Kr the number of processes in each round. A 
presentation of the algorithm's block diagram is given below 
[11]: 
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j. Improved Round Robin Based on Harmonic- 
Arithmetic Mean (HARM) 

In this algorithm is presented a new method of determining 
the quantum of time in each cycle based on the arithmetic 
average and harmonic average. In this case, judging which 
formula is efficient it is used one or the other. Formulas are 
given as follows: 

1 71 

= - X> * 

n 



^ begin ^ 



Sorting according 
to burst time 








Assign new counter 
C, for this process 




Loop 




Loop 


1 




1 


Pi -> MTU,,,, 




Pi -> UTQ^ 










Save CiCPO 
Let Pi. status = 1 



Img 5. MDTQRR block diagram [11] 

The complexity of this algorithm is O(n) so it is within the 
allowed limits. 



X 



1=1 



This is the formula of the arithmetic average, while the 
average harmonic has a slightly complex formula [15]: 



x = n 



-1 



This formula is not given often as it is difficult for many 
people but is quite efficient. It is a clash between three 
formulas of types, harmonic, arithmetic and geometric. Each is 
used in cases that concern us. 

For a set of values in which at least two are not equal, the 
harmonious average is the lowest among them, the arithmetic 
average is the highest and geometric average is between them. 
So the average harmonious is affected by very low values and 
the opposite for the arithmetic. 

So if the burst time of a new process is much smaller than 
the preceding process is, then its best to use the harmonic 
average because the new calculated average is close, it makes 
the average waiting time low. If a process with burst time much 
longer than predecessor comes, its better to use the arithmetic 
average instead of the harmonious. 

So if burst times are heterogeneous then the harmonious 
average has high impact in in reducing the average waiting 
time (Waiting) and the average time to Turnaround. 

V. MY APPROACH 

By carefully observing and analyzing all the techniques 
used and their results I perceive a further development of one 
of the studied algorithms. 

Analyzing the HARM algorithm, which has as its main 
advantage to minimize average waiting and turnaround time, I 
can say that the definition of the arithmetic mean seems wrong. 
If the values of burst time of the new process is much smaller 
then the average harmonic is used for the previously analyzed 
reason, but however if the value is much higher then the 
arithmetic average use may increase the waiting time, because 
the fomrula itself takes the largest value of the group. If we use 
another form intermediate between harmonics and arithmetic, 
which is the geometric average formula given below [15]: 



1/, 
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In one way you can also say that depeneds slightly on the 
concept of Improved-SRBRR and can lead to reducing the 
average waiting time. By keeping the number of context 
switches unchanged, since it only changes the value of TQ 
would be used in case of the arithmetic mean, we use the 
geometric, you can achieve a higher efficiency in the said 
parameter. But this theory must be tested to prove. 

VI. CONCLUSIONS 

It should be said that many of these techniques have high 
sensitivity to the sorting order of burst times in the ready 
queue. Even many of the techniques are built on the basis of 
first sorting and then processing the algorithm. If a new process 
comess then its sorted by the time and variabiles are redefined. 
In the three sorting forms what looks most efficient to these 
techniques is definitely the ascending sorted burst times list. 

In all surveys made for each of the aforementioned viewed 
techniques the comparative etalon serving as the simplest 
algorithm is round-robin scheduling. And what is most 
important to this scheduling, it handles static forms of 
determining the time that is given to each of the processes and 
makes a comparison with the modified technique in the 
dynamic one. It should be noted that in all cases the dynamic 
technique is efficient and even in some cases it is also a drastic 
improvement, in the order of several times. This improved 
system performance is through its three main parameters, that 
is, the average waiting time, number of context switches and 
average turnaround time. So we can say that the dynamic form 
is very much suggested over the static one. 

Surveys show that some of these techniques are designed 
with one primary goal, the improvement of the parameters, and 
specifically minimizing the number of times the processor 
changes the process, thus minimizing the context switches. 
Here we will mention which has the highest efficiency in this 
parameter, that is Multi-Quantum Dynamic Time Round- 
Robin. Although it can lead to maximum CPU use it aims to 
reduce the amount of CS. Unlike other techniques, here the 
arrival time is not zeroed but serves as input. 

Two other parameters, that are average waiting time and 
average turnaround time, have high efficiency in two other 
techniques such as Improved Remaining Burst Round-Robin 
and HARM. Each has a very good performance in this aspect, 
but perhaps seeing the other parameter we could say that 
HARM could be slightly better compared to the first. 

But in this aspect it seems there is continuous efforts for 
improvement and development, especially by students. Areas 
where there is still a problem and it is definitely pertinent to 
real time and embedded systems. However with these rates, 
given that in the last four years have been achieved so many 
techniques, more improvement seems to come from the work 
of universities. 
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